The definition of good strategies for Text Retrieval has become in recent years more and more important. In literature, statistical approaches are frequently mentioned (perhaps, more than used), in an “exploratory data analysis state of mind” (e.g. Greiff, 1998). One crucial point consists in understanding andmeasuring the specificity of each term in order to obtain a mean for measuring the relevance of a document with respect to a query. Many different solutions have been proposed. In Vector Space Models (Salton, 1983) the weighting step considers the term and collection frequency and the length normalization factor. In the canonical for m of Latent Semantic Indexing (Deerwester et al., 1990) weights are given by raw term frequencies. In Latent Semantic Correspondence Indexing (Balbi & Di Meglio, 2004) a weight related to chi-square metrics has been suggested. Aim of this paper is to propose, having in mind the so-called “general analysis in any distances and any criteria”, a distance and a criterion related to terms relevance and specificity, in order to reconsider the weighting systemin a Lexical Correspondence Analysis scheme.

Choosing a proper metrics for textual analysis / Balbi, Simona. - (2007). (Intervento presentato al convegno CARME 2007 tenutosi a Rotterdam (NL) nel 25-27 giugno 2007).

Choosing a proper metrics for textual analysis

BALBI, SIMONA
2007

Abstract

The definition of good strategies for Text Retrieval has become in recent years more and more important. In literature, statistical approaches are frequently mentioned (perhaps, more than used), in an “exploratory data analysis state of mind” (e.g. Greiff, 1998). One crucial point consists in understanding andmeasuring the specificity of each term in order to obtain a mean for measuring the relevance of a document with respect to a query. Many different solutions have been proposed. In Vector Space Models (Salton, 1983) the weighting step considers the term and collection frequency and the length normalization factor. In the canonical for m of Latent Semantic Indexing (Deerwester et al., 1990) weights are given by raw term frequencies. In Latent Semantic Correspondence Indexing (Balbi & Di Meglio, 2004) a weight related to chi-square metrics has been suggested. Aim of this paper is to propose, having in mind the so-called “general analysis in any distances and any criteria”, a distance and a criterion related to terms relevance and specificity, in order to reconsider the weighting systemin a Lexical Correspondence Analysis scheme.
2007
Choosing a proper metrics for textual analysis / Balbi, Simona. - (2007). (Intervento presentato al convegno CARME 2007 tenutosi a Rotterdam (NL) nel 25-27 giugno 2007).
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/322374
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact