Clustering of documents from a two-way viewpoint

Balbi, Simona; Miele, Raffaele; Scepi, Germana

Methods for high-dimensional data clustering represents a prolific research area in data mining, encouraging a large quantity of provisional solutions. In text mining and in the analysis of gene expression data, the idea of bidimensional clustering arose, in the sense of finding clusters of documents characterized by cluster of terms (and analogously, clusters of genes and clusters of different experimental conditions). Although we are often more interested in clustering one way of our data structure, however co clustering seems to be convenient (both from an interpretative and a computational viewpoint). Here we try to frame the problem in a multidimensional data analysis perspective, both referring to classic association and/or prediction indexes for contingency tables. Following previous works, we propose the use of a predictability index, Goodman&Kruskal tb, dealing with documents-by-terms tables. After a quick review of the wide literature related to two-way clustering, mainly developed in microarray analysis, we propose a new algorithm belonging to the genetic family, based on the optimization of the predictability index tau-b. We present experimental results to show the effectiveness of our co-clustering algorithm in practice.

Clustering of documents from a two-way viewpoint / Balbi, Simona; Miele, Raffaele; Scepi, Germana. - STAMPA. - 1:(2010), pp. 27-36. ( JADT 2010 . 10th Iternationak Conference on statistical analysis of textual data Roma 9 - 11 giugno 2010).