TOPIC EXTRACTION FROM DIVIDED DOCUMENT SETS

Takeru Yokoi, Hidekazu Yanagimoto

2009

Abstract

We propose here a method to extract topics from a large document set with the topics included in its divisions and the combination of them. In order to extract topics, the Sparse Non-negative Matrix Factorization that imposes sparse constrain only to a basis matrix, which we call SNMF/L, is applied to document sets. It is useful to combine the topics from some small document sets since if the number of documents is large, the procedure of topic extraction with the SNMF/L from a large corpus takes a long time. In this paper, we have shortened the procedure time for the topic extraction from a large document set with the combining topics that are extracted from respective divided document set. In addition, an evaluation of our proposed method has been carried out with the corresponding topics between the combined topics and the topics from the large document set by the SNMF/L directly, and the procedure times of the SNMF/L.

Download


Paper Citation


in Harvard Style

Yokoi T. and Yanagimoto H. (2009). TOPIC EXTRACTION FROM DIVIDED DOCUMENT SETS . In Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8111-81-4, pages 654-659. DOI: 10.5220/0001822106540659

in Bibtex Style

@conference{webist09,
author={Takeru Yokoi and Hidekazu Yanagimoto},
title={TOPIC EXTRACTION FROM DIVIDED DOCUMENT SETS},
booktitle={Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2009},
pages={654-659},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001822106540659},
isbn={978-989-8111-81-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - TOPIC EXTRACTION FROM DIVIDED DOCUMENT SETS
SN - 978-989-8111-81-4
AU - Yokoi T.
AU - Yanagimoto H.
PY - 2009
SP - 654
EP - 659
DO - 10.5220/0001822106540659