2009 IEEE International Conference on
Systems, Man, and Cybernetics |
![]() |
Abstract
Document clustering is the process to partition a set of unlabeled documents into some clusters such that documents in each cluster share some common concepts. In order to analyze easily, the concept is most conveniently represented using some key terms. For clustering algorithm, the most cost is the classification phase. Using words as features, text data are always represented as a very high dimensional vector space. We have studied a comparative advantage based algorithm for clustering sparse data, it uses one "ruler" instead of k centers to identify the comparative advantage of each cluster and define the cluster label for each document. However, that algorithm only considers the relative strength between clusters, the relationship between terms is ignored. In this paper, we proposed a weighted comparative advantage based clustering algorithm. The Experimental results based on SMART system data bases show that the new algorithm is better than simple comparative advantage algorithm, without any extra computation time. Compare with k-means, it can not only get a comparable result but also significantly accelerates the clustering procedure.