Aim of the paper is to propose a Text Mining strategy based on statistical tools, which make more efficient the extraction of information buried in massive quantities of documents. Usually, in Text Mining procedures (such as in textual data analyses) we deal with a corpus consisting of a set of documents. In order to build the data structure to be processed, each document is encoded in a document vector, according to the bag-of-words model, which associates words and their frequencies for the given document. Documents are considered as a whole. The proposed mining strategy identifies interesting sentences in the corpus we deal with, where to concentrate the knowledge extraction. The sentence interest will depend on the researcher’s objective. The proposed procedure is useful when we are interested in local contexts for words. Prior information, i.e. expert knowledge, is included, as an input for the procedure, but differently to content analysis, the key-word system is automatically built. The strategy can be applied in any case we can introduce information for partitioning documents in lower order grammatical units (e.g. sentences, but also paragraphs, etc.). The mining procedure consists in two steps: first of all the Text Categorisation, i.e. the recognition of the interesting sentences, by means of a statistical segmentation procedure, and then the knowledge extraction from the identified sub-texts. The procedure first step produces association rules useful in filtering e-mail, chat, or Web access, too. The paper aims at contributing to the day-by-day wider literature on Text Mining, devoted to go beyond the "bag-of-words" model of structuring the data set in document vectors, enhancing the role of a statistical perspective. An application on Italian on-line job offers ends the paper, showing the effectiveness of the proposal.

A text mining strategy based on local contexts of words / Balbi, Simona; DI MEGLIO, E.. - STAMPA. - 1:(2004), pp. 79-87. ( JADT 2004 7th International Conference on the Statistical Analysis of Textual Data Louvain La Neuve/ Belgique 10-12 MARZO 2004).

A text mining strategy based on local contexts of words

BALBI, SIMONA;
2004

Abstract

Aim of the paper is to propose a Text Mining strategy based on statistical tools, which make more efficient the extraction of information buried in massive quantities of documents. Usually, in Text Mining procedures (such as in textual data analyses) we deal with a corpus consisting of a set of documents. In order to build the data structure to be processed, each document is encoded in a document vector, according to the bag-of-words model, which associates words and their frequencies for the given document. Documents are considered as a whole. The proposed mining strategy identifies interesting sentences in the corpus we deal with, where to concentrate the knowledge extraction. The sentence interest will depend on the researcher’s objective. The proposed procedure is useful when we are interested in local contexts for words. Prior information, i.e. expert knowledge, is included, as an input for the procedure, but differently to content analysis, the key-word system is automatically built. The strategy can be applied in any case we can introduce information for partitioning documents in lower order grammatical units (e.g. sentences, but also paragraphs, etc.). The mining procedure consists in two steps: first of all the Text Categorisation, i.e. the recognition of the interesting sentences, by means of a statistical segmentation procedure, and then the knowledge extraction from the identified sub-texts. The procedure first step produces association rules useful in filtering e-mail, chat, or Web access, too. The paper aims at contributing to the day-by-day wider literature on Text Mining, devoted to go beyond the "bag-of-words" model of structuring the data set in document vectors, enhancing the role of a statistical perspective. An application on Italian on-line job offers ends the paper, showing the effectiveness of the proposal.
2004
9782930344492
A text mining strategy based on local contexts of words / Balbi, Simona; DI MEGLIO, E.. - STAMPA. - 1:(2004), pp. 79-87. ( JADT 2004 7th International Conference on the Statistical Analysis of Textual Data Louvain La Neuve/ Belgique 10-12 MARZO 2004).
File in questo prodotto:
File Dimensione Formato  
a text mining strategy based on local context of words.pdf

non disponibili

Tipologia: Documento in Post-print
Licenza: Accesso privato/ristretto
Dimensione 133.25 kB
Formato Adobe PDF
133.25 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/116708
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact