Document Clustering is the peculiar application of cluster analysis methods on huge documentary databases. Document Clustering aims at organizing a large quantity of unlabelled documents into a smaller number of meaningful and coherent clusters, similar in content. One of the main unsolved problems in clustering literature is the lack of a reliable methodology to evaluate results, although a wide variety of validation measures has been proposed. If those measures are often unsatisfactory when dealing with numerical databases, they definitely underperform in Document Clustering. This paper proposes a new validation measure. After introducing the most common approaches to Document Clustering, our attention is focused on Spherical K-means, do to its strict connection with the Vector Space Model, typical of Information Retrieval. Since Spherical K-means adopts a cosine-based similarity measure, we propose a validation measure based on the same criterion. The new measure effectiveness is shown in the frame of a comparative study, by involving 13 different corpora (usually used in literature for comparing different proposals) and 15 validation measures.

A cosine based validation measure for Document Clustering / Balbi, Simona; Misuraca, Michelangelo; Spano, Maria. - 1:(2016), pp. 65-74. (Intervento presentato al convegno JADT2016 International Conference on Statistical Analysis of Textual Data tenutosi a Nizza nel 7-10 giugno 2016).

A cosine based validation measure for Document Clustering

BALBI, SIMONA;SPANO, MARIA
2016

Abstract

Document Clustering is the peculiar application of cluster analysis methods on huge documentary databases. Document Clustering aims at organizing a large quantity of unlabelled documents into a smaller number of meaningful and coherent clusters, similar in content. One of the main unsolved problems in clustering literature is the lack of a reliable methodology to evaluate results, although a wide variety of validation measures has been proposed. If those measures are often unsatisfactory when dealing with numerical databases, they definitely underperform in Document Clustering. This paper proposes a new validation measure. After introducing the most common approaches to Document Clustering, our attention is focused on Spherical K-means, do to its strict connection with the Vector Space Model, typical of Information Retrieval. Since Spherical K-means adopts a cosine-based similarity measure, we propose a validation measure based on the same criterion. The new measure effectiveness is shown in the frame of a comparative study, by involving 13 different corpora (usually used in literature for comparing different proposals) and 15 validation measures.
2016
978-2-7466-9067-7
A cosine based validation measure for Document Clustering / Balbi, Simona; Misuraca, Michelangelo; Spano, Maria. - 1:(2016), pp. 65-74. (Intervento presentato al convegno JADT2016 International Conference on Statistical Analysis of Textual Data tenutosi a Nizza nel 7-10 giugno 2016).
File in questo prodotto:
File Dimensione Formato  
jadt2016.pdf

accesso aperto

Descrizione: Articolo
Tipologia: Documento in Post-print
Licenza: Dominio pubblico
Dimensione 224.78 kB
Formato Adobe PDF
224.78 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/661760
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact