Categorization of Very Short Documents

Mika Timonen

2012

Abstract

Categorization of very short documents has become an important research topic in the field of text mining. Twitter status updates and market research data form an interesting corpus of documents that are in most cases less than 20 words long. Short documents have one major characteristic that differentiate them from traditional longer documents: each word occurs usually only once per document. This is called the TF=1 challenge. In this paper we conduct a comprehensive performance comparison of the current feature weighting and categorization approaches using corpora of very short documents. In addition, we propose a novel feature weighting approach called Fragment Length Weighted Category Distribution that takes the challenges of short documents into consideration. The proposed approach is based on previous work on Bi-Normal Separation and on short document categorization using a Naive Bayes classifier. We compare the performance of the proposed approach against several traditional approaches including Chi-Squared, Mutual Information, Term Frequency-Inverse Document Frequency and Residual Inverse Document Frequency. We also compare the performance of a Support Vector Machine classifier against other classification approaches such as k-Nearest Neighbors and Naive Bayes classifiers.

Download


Paper Citation


in Harvard Style

Timonen M. (2012). Categorization of Very Short Documents . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 5-16. DOI: 10.5220/0004108300050016

in Bibtex Style

@conference{kdir12,
author={Mika Timonen},
title={Categorization of Very Short Documents},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={5-16},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004108300050016},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Categorization of Very Short Documents
SN - 978-989-8565-29-7
AU - Timonen M.
PY - 2012
SP - 5
EP - 16
DO - 10.5220/0004108300050016