A Comparison of Document Clustering Algorithms
Yong Wang
1
and Julia Hodges
2
Department of Computer Science & Engineering, Mississippi State University
Box 9637, Mississippi State, MS 39762
Abstract. Document clustering is a widely u
sed strategy for information
retrieval and text data mining. This paper describes the preliminary work for
ongoing research of document clustering problems. A prototype of a document
clustering system has been implemented and some basic aspects of document
clustering problems have been studied. Our experimental results demonstrate
that the average-link inter-cluster distance measure and TFIDF weighting
function are good methods for the document clustering problem. Other
investigators have indicated that the bisecting K-means method is the preferred
method for document clustering. However, in our research we have found that,
whereas the bisecting K-means method has advantages when working with
large datasets, a traditional hierarchical clustering algorithm still achieves the
best performance for small datasets.
1 Introduction
Data clustering partitions a set of unlabeled objects into disjoint/joint groups of
clusters. In a good cluster, all the objects within a cluster are very similar while the
objects in other clusters are very different. When the data processed is a set of
documents, it is called document clustering. Document clustering is very important
and useful in the information retrieval area. It can be applied to facilitate the
retrieving the useful documents for the user. Generally, the feedback of an
information retrieval system is a ranked list ordered by their estimated relevance to
the query. When the volume of an information database is small and the query
formulated by the user is well defined, this ranked list approach is efficient. But for a
tremendous information source, such as the World Wide Web, and poor query
conditions (just one or two key words), it is difficult for the retrieval system to
identify the interesting items for the user. Applying documenting clustering to the
retrieved documents could make it easier for the users to browse their results and
locate what they want quickly. A successful example of this application is VIVISIMO
(http://vivisimo.com/), which is a Web search engine that organizes search results
with document clustering. Another application of document clustering is the
automated or semi-automated creation of document taxonomies. A good taxonomy
for Web documents is Yahoo.
This paper describes our preliminary work for research of document clustering
pr
oblems. A prototype of a document clustering system has been implemented and
some basic aspects of document clustering problem have been studied. In section 2,
some document clustering algorithms are introduced. Section 3 presents our
Wang Y. and Hodges J. (2005).
A Comparison of Document Clustering Algorithms.
In Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems, pages 186-191
DOI: 10.5220/0002557501860191
Copyright
c
SciTePress
experimental results and analysis for different cluster distance measures, different
weighting functions, and different clustering algorithms. Section 4 lists our final
conclusions.
2 Documents Clustering Algorithms
Hierarchical clustering generates a hierarchical tree of clusters. Hierarchical methods
can be further classified into agglomerative methods (HAC) and divisive methods.
Partitioning clustering methods allocate data into a fixed number of non-empty
clusters. All the clusters are in the same level. The most well-known partitioning
methods following this principle are the K-means method and its variants. The
buckshot method is a combination of the K-means method and HAC method. HAC
method is used to select the seeds and then the K-means method is applied. The
buckshot method is successfully used in a well-known document clustering system,
the Scatter/Gather (SG) system [2]. The K-means method can also be used to generate
hierarchical clusters. Steinbach, Karypis, and Kumar proposed bisecting K-means
algorithm to generate hierarchical clusters by applying the basic K-means method
recursively [7].
Besides these basic clustering algorithms, some particular algorithms for document
clustering were proposed. Zamir has described the use of a suffix tree for document
clustering [8]. Beil, Ester, and Xu proposed two clustering methods, FTC (Frequent
Term-based Clustering) and HFTC (Hierarchical Frequent Term-based Clustering),
based on frequent term sets [1]. Fung proposed another hierarchical document
clustering method based on the frequent term set, HIFC (Frequent Itemset-based
Hierarchical Clustering), to improve the HFTC method [3].
3 Experiments
3.1 Experimental Data and Evaluation Method
We collected 10,000 abstracts from journals belonging to ten different areas. For each
area, 1000 abstracts were collected. Table 1 lists the areas and the names of the
journals. This full data set was divided evenly into 5 subsets. Each subset contains
200 abstracts from each category and they are named as FDS 1 - 5. Another mini data
set is selected from the full dataset. There are 1000 abstracts from 10 categories in
this mini dataset. This mini dataset is partition into 5 groups evenly too. They are
named as MDS 1 - 5.
All these abstracts were cut into the sentences with MXTERMINATOR [6]. Then
the tokens were identified from each sentence with the Penn Treebank tokenizer. The
lemmatizer in WordNet was used to convert each token into lemma [4]. All the stop
words are filtered. Finally, a document is converted into a list of lemmas. These
lemmas will be used to construct the feature vector for each document.
The evaluation methods F-measure and entropy, which have been used by a
number of researchers, including Steinbach, Karypis, and Kumar [7], will be used in
187
our experiments. Both entropy and F-measure are external quality measures. F-
measure is a combination of precision and recall which come from information
retrieval area. A higher F-measure indicates a better performance. The entropy for
each cluster reflects the homogeneity of each cluster. A smaller value indicates a
higher homogeneity. The detailed formula of F-measure and entropy is provided in
[7].
Table 1. Journal Abstracts Data Set
Area Journal Name Area Journal Name
Artificial
Intelligence
Artificial Intelligence Material Journal of Electronic Materials
Ecology Journal of Ecology Nuclear IEEE Transactions on Nuclear Science
Economy Economic Journal Proteomics PubMed
History Historical Abstracts Sociology Journal of Sociology
Linguistics Journal of Linguistics Statistics Journal of Applied Statistics
Regression Analysis
Table 2. Comparison of Inter-Cluster Distance Measures
FDS 1 FDS 2 FDS 3 FDS 4 FDS 5 MDS 1 MDS 2 MDS 3 MDS 4 MDS 5
F-measure
S-link 0.19 0.23 0.25 0.19 0.19 0.52 0.35 0.40 0.37 0.50
C-link 0.18 0.18 0.18 0.18 0.18 0.18 0.18 0.18 0.18 0.19
A-link
0.74 0.55 0.60 0.59 0.66 0.77 0.77 0.78 0.63 0.77
Entropy
S-link 2.17 2.08 2.04 2.19 2.17 1.29 1.70 1.60 1.62 1.33
C-link 2.29 2.29 2.29 2.29 2.29 2.20 2.20 2.20 2.20 2.17
A-link
0.73 1.19 0.98 1.09 0.91 0.58 0.51 0.52 0.84 0.64
3.2 Experimental Results and Analysis
Comparison of Different Cluster Distance Measures in HAC Method
There are three generally used distance measures between clusters, single link
(minimum distance), complete link (maximum distance), and average link (average
distance). The comparison results of these three method are listed in table 2. In both
F-measure and entropy, the average-link method achieved the best performance for all
full data sets and mini data sets. This result is consistent with Steinbach, Karypis, and
Kumar [7]. Different from the single-link and complete-link, the average-link
measure considers every pairwise distance between two elements in two clusters and
averages them. This measure reflects a global relatedness between two clusters.
Comparison of Term Weighting Methods
Different term weighting methods may be used in the vector space model. The
simplest method is a binary weighting function in which a value of 1 indicates the
occurrence of that term and a value of 0 indicates the absence. Another term
weighting method for w
ij
is term frequency (tf). In this method, each feature vector is
represented by a list of occurrence frequencies of all terms. In order to avoid the
effect of the varying lengths of the documents, all the occurrence frequencies should
be normalized before being used. Inverse document frequency (idf) is a method that
188
tries to filter those terms that occur too frequently. The most widely used term
weighting method in the vector space model is tf-idf which is a combination of tf and
idf.
Binary(Bi), Term Frequency (Tf), and Term Frequency Inverse Document
Frequency (Tf-Idf) were tested in this experiment. The HAC clustering method using
average-link cluster distance measure was used as the clustering method. Results of
this experiment are listed in table 3. Notice that TFIDF had the best results for all data
sets in both F-measure and entropy. From the introduction of these three methods
given in section 3, we know that both the binary method and TF method try to weight
a feature based just on the individual document. They assume that a feature with a
high frequency in a document may be a key word for this document and should be
useful to distinguish this document from others. This assumption is untenable when
this feature occurs in most or all documents in the whole collection. The IDF measure
helps to evaluate the importance of a word to the whole collection. As a combination
of the TF and IDF methods, the TFIDF weighting function tries to select the features
which are important for both the individual document and the whole collection.
Table 3. Comparison of Term Weighting Methods
FDS 1 FDS 2 FDS 3 FDS 4 FDS 5 MDS 1 MDS 2 MDS 3 MDS 4 MDS 5
F-measure
Bi 0.27 0.23 0.30 0.36 0.19 0.51 0.65 0.44 0.43 0.31
Tf 0.27 0.23 0.31 0.44 0.18 0.73 0.64 0.53 0.55 0.55
Tf-
Idf
0.74 0.55 0.60 0.59 0.66 0.77 0.78 0.78 0.63 0.77
Entropy
Bi 2.00 2.16 1.96 1.75 2.24 1.12 0.81 1.40 1.39 1.79
Tf 2.03 2.15 1.81 1.52 2.28 0.59 0.78 1.04 1.03 1.04
Tf-
Idf
0.73 1.19 0.98 1.09 0.91 0.58 0.51 0.52 0.84 0.64
Comparison of Clustering Algorithms
Four basic clustering algorithms, K-means, buckshot, HAC, and bisecting K-means,
were selected for comparison. In this experiment, K-means method, buckshot method,
and bisecting K-means method are executed 20 times to alleviate the effect of a
random factor. The F-measure and entropy listed here are the average values of 20
different results. In five full data sets, we found that the bisecting method outperforms
all the other methods. The K-Means method and buckshot method achieve similar
results. The HAC method, only in the first dataset, gets a similar result to K-means
and buckshot method. But for the other four datasets, the results of the HAC method
are less than that of the K-means method and buckshot method by about 10-15
percentage points. In the five mini data sets, HAC achieves the best performance. The
results of K-means, buckshot, and the bisecting K-means method are similar and low.
Our results for the full data set are consistent with the results of Steinbach, Karypis,
and Kumar [7]. A comprehensive analysis provided by Steinbach et al. explained that
the nature of the document clustering problem is the reason for the worse performance
of hierarchical approaches and the good performance of the bisecting K-means
method. In all these clustering algorithms, two documents that share more common
words will be considered as more similar to each other. The problem is that two
189
documents consisting of the same set of words may be about two totally different
topics. It is very possible that the nearest neighbors of a document may belong to
different categories. In the HAC method, if two documents were assigned into the
same group, they will always be in the same group. This assignment may be optimal
in that step, but from the view of the whole partition, it may not be optimal. The HAC
method just tries to get local optimality in each step with no attempt for global
optimality. The advantage of the K-means method, buckshot method and bisecting K-
means method is their adjusting of each cluster after each iteration. This reassignment
is helpful for a global optimality. Compared with the K-means method and buckshot
method, bisecting K-means can generate more evenly partitioned clusters. A balanced
performance for each cluster is helpful for a higher global result.
Table 4. Comparison of Clustering Algorithms
FDS 1 FDS 2 FDS 3 FDS 4 FDS 5 MDS 1 MDS 2 MDS 3 MDS 4 MDS 5
F-measure
KM 0.77 0.72 0.79 0.72 0.74 0.41 0.48 0.42 0.39 0.42
BS 0.73 0.73 0.75 0.74 0.72 0.51 0.51 0.47 0.48 0.48
HAC 0.74 0.55 0.60 0.59 0.66
0.77 0.77 0.78 0.63 0.77
BiKM
0.90 0.85 0.87 0.86 0.88
0.38 0.40 0.34 0.36 0.37
Entropy
KM 0.60 0.73 0.60 0.74 0.67 1.50 1.35 1.50 1.61 1.50
BS 0.67 0.70 0.65 0.72 0.71 1.28 1.24 1.38 1.37 1.34
HAC 0.73 1.19 0.98 1.09 0.91
0.58 0.51 0.52 0.84 0.64
BiKM
0.40 0.50 0.46 0.51 0.45
1.60 1.55 1.71 1.63 1.63
Then why is the HAC clustering method the best one for the mini datasets? We
think the major reason is the size of the data set. There are 2000 abstracts in each full
data set and they are considered as 2000 clusters in the initial step of the HAC method.
We know in each iteration of the HAC method, two nearest clusters are merged
together to get local optimality. This optimality may be not helpful, and may even be
harmful, for the next iteration. This loop will be repeated 1990 times to get the final
10 clusters; the advantages of each step will have counteracted with each other. Since
there is no global reassignment procedure in the HAC method, the final partitions
cannot be improved at any steps. In our mini data set, there are only 200 abstracts in
each set. The iteration was repeated only 190 times. The advantage of the optimality
in each step is still higher than that of the reassignment functions in the K-means,
buckshot, and bisecting methods. When we checked the detailed debugging
information for the K-means, buckshot, and bisecting methods, we found that for the
mini data sets, K-means, buckshot, and the bisecting method had only 1 or 2 iterations.
This means that, for those small datasets, the results of K-means, buckshot, and
bisecting method are similar to that of a randomly partitioning. Notice that in the full
data set, the performance of the buckshot method is similar to that of the K-means
method. But in the mini data set, the performance of the buckshot method is always
higher than that of K-means for all five mini datasets. This demonstrates that the use
of HAC for seed selection is helpful for small data sets but not for the large data sets.
There are eight data sets were used for evaluation in Steinbach, Karypis, and
Kumar’s experiments [7]. The largest data set contains about 3000 documents and
smallest one contains about 1000 documents. This size is similar to the size of our full
190
data set. This also demonstrates that the good performance of bisecting K-means
method achieved by in Steinbach, Karypis, and Kumar’s experiments is also based on
large data set.
4 Conclusions
In this paper, we described a general document clustering system prototype and
discussed three ways to achieve better performance. A brief overview about different
document clustering algorithms, vector weighting functions, and distance measures
was provided. From our experimental results, we can below conclusions.
For the HAC clustering method, the average-link inter-clustering distance measure
is better than the single-link method and complete-link method. For weighting
functions in the vector space model, the TFIDF method is better than the binary
method and TF method. The TFIDF method assigns a weight to a feature by
combining its importance in a document and its distinguishability for whole document
set. An important term may be a medium frequency word instead of a high frequency
word (too common) or a low frequency word (too particular). For large data sets, the
bisecting algorithm outperforms all the other methods. But for small data sets, the
HAC method gets the best performance. The K-means method has a performance that
is similar to that of the buckshot method for large data sets. But for small data sets,
the buckshot method is better than the K-means method. The advantage of the HAC
method on small data sets improves the performance of the buckshot method.
References
1. F. Beil, M. Ester, and X. Xu, “Frequent Term-Based Text Clustering,” Proc. of the 8th
International Conference on Knowledge Discovery and Data Mining, 2002.
2. D. Cutting, D. Karger, J. Pedersen, and J. Tukey, “Scatter/Gather: a Clusterbased Approach
to Browsing Large Document Collection,” Proc. of the 15th ACM SIGIR Conference,
Copenhagen, Denmark, 1992, pp. 318-329.
3. B. C. M. Fung, Hierarchical Document Clustering Using Frequent Itemsets, Master Thesis,
Dept. Computer Science, Simon Fraser University, Canada, 2002.
4. G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller, “Introduction to
WordNet: An On-Line Lexical Database,” International Journal of Lexicography, vol. 3, no.
4, 1990, pp. 235-312.
5. A. Ratnaparkhi, “A Maximum Entropy Part-Of-Speech Tagger,Proc. of the Empirical
Methods in Natural Language Processing Conference, University of Pennsylvania, May
1996, pp. 17-18.
6. J. C. Reynar and A. Ratnaparkhi, “A Maximum Entropy Approach to Identifying Sentence
Boundaries,” Proc. of the Fifth Conference on Applied Natural Language Processing,
Washington, D.C., March 31-April 3, 1997.
7. M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering
Techniques,” KDD Workshop on Text Mining, 2000.
8. O. Zamir, Clustering Web Documents: A Phrase-Based Method for Group Search Engine
Results, Ph.D. dissertation, Dept. Computer Science & Engineering, Univ. of Washington,
1999.
191