Stopwords Identification by Means of Characteristic and Discriminant Analysis

Giuliano Armano, Francesca Fanni, Alessandro Giuliani

2015

Abstract

Stopwords are meaningless, non-significant terms that frequently occur in a document. They should be removed, like a noise. Traditionally, two different approaches of building a stoplist have been used: the former considers the most frequent terms looking at a language (e.g., english stoplist), the other includes the most occurring terms in a document collection. In several tasks, e.g., text classification and clustering, documents are typically grouped into categories. We propose a novel approach aimed at automatically identifying specific stopwords for each category. The proposal relies on two unbiased metrics that allow to analyze the informative content of each term; one measures the discriminant capability and the latter measures the characteristic capability. For each term, the former is expected to be high in accordance with the ability to distinguish a category against others, whereas the latter is expected to be high according to how the term is frequent and common over all categories. A preliminary study and experiments have been performed, pointing out our insight. Results confirm that, for each domain, the metrics easily identify specific stoplist wich include classical and category-dependent stopwords.

Download


Paper Citation


in Harvard Style

Armano G., Fanni F. and Giuliani A. (2015). Stopwords Identification by Means of Characteristic and Discriminant Analysis . In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-074-1, pages 353-360. DOI: 10.5220/0005194303530360

in Bibtex Style

@conference{icaart15,
author={Giuliano Armano and Francesca Fanni and Alessandro Giuliani},
title={Stopwords Identification by Means of Characteristic and Discriminant Analysis},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2015},
pages={353-360},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005194303530360},
isbn={978-989-758-074-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Stopwords Identification by Means of Characteristic and Discriminant Analysis
SN - 978-989-758-074-1
AU - Armano G.
AU - Fanni F.
AU - Giuliani A.
PY - 2015
SP - 353
EP - 360
DO - 10.5220/0005194303530360