SIMILARITY OF DOCUMENTS AND DOCUMENT COLLECTIONS USING ATTRIBUTES WITH LOW NOISE

Chris Biemann, Uwe Quasthoff

2007

Abstract

In this paper, a unified framework for clustering documents based on vocabulary overlap and in-link similarity is presented. A small number of non-zero attributes per document, taken from a very large set of possible attributes, ensure efficient comparisons procedures. We show that A) low frequent words are excellent attributes for textual documents as well as B) sources of in-links as attributes for web documents. In the cases of web documents, co-occurrence analysis is used to identify similarity. The documents are represented as nodes in a graph with edges weighted by similarity. A graph clustering algorithm is applied to group similar documents together. Evaluation for textual documents against a gold standard is provided.

Download


Paper Citation


in Harvard Style

Biemann C. and Quasthoff U. (2007). SIMILARITY OF DOCUMENTS AND DOCUMENT COLLECTIONS USING ATTRIBUTES WITH LOW NOISE . In Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-972-8865-78-8, pages 130-135. DOI: 10.5220/0001260601300135

in Bibtex Style

@conference{webist07,
author={Chris Biemann and Uwe Quasthoff},
title={SIMILARITY OF DOCUMENTS AND DOCUMENT COLLECTIONS USING ATTRIBUTES WITH LOW NOISE},
booktitle={Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2007},
pages={130-135},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001260601300135},
isbn={978-972-8865-78-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - SIMILARITY OF DOCUMENTS AND DOCUMENT COLLECTIONS USING ATTRIBUTES WITH LOW NOISE
SN - 978-972-8865-78-8
AU - Biemann C.
AU - Quasthoff U.
PY - 2007
SP - 130
EP - 135
DO - 10.5220/0001260601300135