Methods for high-dimensional data clustering represents a prolific research area in data mining, encouraging a large quantity of provisional solutions. In text mining and in the analysis of gene expression data, the idea of bidimensional clustering arose, in the sense of finding clusters of documents characterized by cluster of terms (and analogously, clusters of genes and clusters of different experimental conditions). Although we are often more interested in clustering one way of our data structure, however co clustering seems to be convenient (both from an interpretative and a computational viewpoint). Here we try to frame the problem in a multidimensional data analysis perspective, both referring to classic association and/or prediction indexes for contingency tables. Following previous works, we propose the use of a predictability index, Goodman&Kruskal tb, dealing with documents-by-terms tables. After a quick review of the wide literature related to two-way clustering, mainly developed in microarray analysis, we propose a new algorithm belonging to the genetic family, based on the optimization of the predictability index tau-b. We present experimental results to show the effectiveness of our co-clustering algorithm in practice.

Clustering of documents from a two-way viewpoint

BALBI, SIMONA;MIELE, RAFFAELE;SCEPI, GERMANA
2010

Abstract

Methods for high-dimensional data clustering represents a prolific research area in data mining, encouraging a large quantity of provisional solutions. In text mining and in the analysis of gene expression data, the idea of bidimensional clustering arose, in the sense of finding clusters of documents characterized by cluster of terms (and analogously, clusters of genes and clusters of different experimental conditions). Although we are often more interested in clustering one way of our data structure, however co clustering seems to be convenient (both from an interpretative and a computational viewpoint). Here we try to frame the problem in a multidimensional data analysis perspective, both referring to classic association and/or prediction indexes for contingency tables. Following previous works, we propose the use of a predictability index, Goodman&Kruskal tb, dealing with documents-by-terms tables. After a quick review of the wide literature related to two-way clustering, mainly developed in microarray analysis, we propose a new algorithm belonging to the genetic family, based on the optimization of the predictability index tau-b. We present experimental results to show the effectiveness of our co-clustering algorithm in practice.
9788879164504
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/369927
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact