Building TALAA, a Free General and Categorized Arabic Corpus

Essma Selab, Ahmed Guessoum

2015

Abstract

Arabic natural language processing (ANLP) has gained increasing interest over the last decade. However, the development of ANLP tools depends on the availability of large corpora. It turns out unfortunately that the scientific community has a deficit in large and varied Arabic corpora, especially ones that are freely accessible. With the Internet continuing its exponential growth, Arabic Internet content has also been following the trend, yielding large amounts of textual data available through different Arabic websites. This paper describes the TALAA corpus, a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles. A part of the TALAA corpus has been tagged to construct an annotated Arabic corpus of about 7000 tokens, the POS-tagger used containing a set of 58 detailed tags. The annotated corpus was manually checked by two human experts. The methodology used to construct TALAA is presented and various metrics are applied to it, showing the usefulness of the corpus. The corpus can be made available to the scientific community upon authorisation.

Download


Paper Citation


in Harvard Style

Selab E. and Guessoum A. (2015). Building TALAA, a Free General and Categorized Arabic Corpus . In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: PUaNLP, (ICAART 2015) ISBN 978-989-758-073-4, pages 284-291. DOI: 10.5220/0005352102840291

in Bibtex Style

@conference{puanlp15,
author={Essma Selab and Ahmed Guessoum},
title={Building TALAA, a Free General and Categorized Arabic Corpus},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: PUaNLP, (ICAART 2015)},
year={2015},
pages={284-291},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005352102840291},
isbn={978-989-758-073-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: PUaNLP, (ICAART 2015)
TI - Building TALAA, a Free General and Categorized Arabic Corpus
SN - 978-989-758-073-4
AU - Selab E.
AU - Guessoum A.
PY - 2015
SP - 284
EP - 291
DO - 10.5220/0005352102840291