KOSHIK- A Large-scale Distributed Computing Framework for NLP

Peter Exner, Pierre Nugues

2014

Abstract

In this paper, we describe KOSHIK, an end-to-end framework to process the unstructured natural language content of multilingual documents. We used the Hadoop distributed computing infrastructure to build this framework as it enables KOSHIK to easily scale by adding inexpensive commodity hardware. We designed an annotation model that allows the processing algorithms to incrementally add layers of annotation without modifying the original document. We used the Avro binary format to serialize the documents. Avro is designed for Hadoop and allows other data warehousing tools to directly query the documents. This paper reports the implementation choices and details of the framework, the annotation model, the options for querying processed data, and the parsing results on the English and Swedish editions of Wikipedia.

Download


Paper Citation


in Harvard Style

Exner P. and Nugues P. (2014). KOSHIK- A Large-scale Distributed Computing Framework for NLP . In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-018-5, pages 463-470. DOI: 10.5220/0004707704630470

in Bibtex Style

@conference{icpram14,
author={Peter Exner and Pierre Nugues},
title={KOSHIK- A Large-scale Distributed Computing Framework for NLP},
booktitle={Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2014},
pages={463-470},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004707704630470},
isbn={978-989-758-018-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - KOSHIK- A Large-scale Distributed Computing Framework for NLP
SN - 978-989-758-018-5
AU - Exner P.
AU - Nugues P.
PY - 2014
SP - 463
EP - 470
DO - 10.5220/0004707704630470