A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS

Daniel Osuna-Ontiveros, Ivan Lopez-Arevalo, Victor Sosa-Sosa

2011

Abstract

Information retrieval (IR) models process documents for preparing them for search by humans or computers. In the early models, the general idea was making a lexico-syntactic processing of documents, where the importance of the documents retrieved by a query is based on the frequency of its terms in the document. Another approach is return predefined documents based on the type of query the user make. Recently, some researchers have combined text mining techniques to enhance the document retrieval. This paper proposes a semantic clustering approach to improve traditional information retrieval models by representing topics associated to documents. This proposal combines text mining algorithms and natural language processing. The approach does not use a priori queries, instead clusters terms, where each cluster is a set of related words according to the content of documents. As result, a document-topic matrix representation is obtained denoting the importance of topics inside documents. For query processing, each query is represented as a set of clusters considering its terms. Thus, a similarity measure (e.g. cosine similarity) can be applied over this array and the matrix of documents to retrieve the most relevant documents.

Download


Paper Citation


in Harvard Style

Osuna-Ontiveros D., Lopez-Arevalo I. and Sosa-Sosa V. (2011). A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011) ISBN 978-989-8425-79-9, pages 280-285. DOI: 10.5220/0003663802880293

in Bibtex Style

@conference{kdir11,
author={Daniel Osuna-Ontiveros and Ivan Lopez-Arevalo and Victor Sosa-Sosa},
title={A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)},
year={2011},
pages={280-285},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003663802880293},
isbn={978-989-8425-79-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)
TI - A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS
SN - 978-989-8425-79-9
AU - Osuna-Ontiveros D.
AU - Lopez-Arevalo I.
AU - Sosa-Sosa V.
PY - 2011
SP - 280
EP - 285
DO - 10.5220/0003663802880293