In text mining procedures, clustering techniques are fundamental tools for reducing the huge amount of textual data to be explored. From a statistical perspective, there are some preliminary questions to be solved: ¯rst, how to structurethe data. Here we adopt Lebart and Salem's viewpoint: in analysing a corpus, the statistical unit is given by the occurrence of a word in a document. Therefore, the data structure to be dealt with is the peculiar contingency table cross-classifyingwords by documents. Then, we have to choose the proper clustering algorithm, and the proper criterion. In this paper, we focus attention to clustering documents, by a simultaneous algorithm. In literature, the numerous advantages of this approach also for one-dimensional aims have been often underlined: for the computational burden (sparse and high-dimensional data, as it is always the case when we dealwith word-document matrices), for optimising the objective function on the basis of the association/dependence structure in the table. Additionally, with an exploratory aim, interpreting document clusters in terms of clusters of words is more informative compared with descriptions based on single words. Our proposal mainly concerns the proper measure for computing similarities. In order to obtained an optimal partition we consider the discrimination power of words in identifying group ofdocuments sharing the same content. At this aim we propose the use of TF/IDF index, which takes into account both the frequency and the discrimination power of each term in the corpus. Starting from the van Mechelen et al.'s taxonomy for two-mode clustering methods, we put our proposal in the family of methods that implyrow/column partitions, in a deterministic approach.We will show the e®ectiveness of our proposal, comparing our results with those obtained by using an usual one-mode partitioning method, on the basis of case studies.

Simultaneous Clustering for Mining Texts / Balbi, S.. - (2009). (IFCS@GfKl Dresden Univerity (D) 13-18 marzo).

Simultaneous Clustering for Mining Texts

BALBI, SIMONA
2009

Abstract

In text mining procedures, clustering techniques are fundamental tools for reducing the huge amount of textual data to be explored. From a statistical perspective, there are some preliminary questions to be solved: ¯rst, how to structurethe data. Here we adopt Lebart and Salem's viewpoint: in analysing a corpus, the statistical unit is given by the occurrence of a word in a document. Therefore, the data structure to be dealt with is the peculiar contingency table cross-classifyingwords by documents. Then, we have to choose the proper clustering algorithm, and the proper criterion. In this paper, we focus attention to clustering documents, by a simultaneous algorithm. In literature, the numerous advantages of this approach also for one-dimensional aims have been often underlined: for the computational burden (sparse and high-dimensional data, as it is always the case when we dealwith word-document matrices), for optimising the objective function on the basis of the association/dependence structure in the table. Additionally, with an exploratory aim, interpreting document clusters in terms of clusters of words is more informative compared with descriptions based on single words. Our proposal mainly concerns the proper measure for computing similarities. In order to obtained an optimal partition we consider the discrimination power of words in identifying group ofdocuments sharing the same content. At this aim we propose the use of TF/IDF index, which takes into account both the frequency and the discrimination power of each term in the corpus. Starting from the van Mechelen et al.'s taxonomy for two-mode clustering methods, we put our proposal in the family of methods that implyrow/column partitions, in a deterministic approach.We will show the e®ectiveness of our proposal, comparing our results with those obtained by using an usual one-mode partitioning method, on the basis of case studies.
2009
Simultaneous Clustering for Mining Texts / Balbi, S.. - (2009). (IFCS@GfKl Dresden Univerity (D) 13-18 marzo).
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/350239
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact