A text mining strategy based on local contexts of words

Balbi, Simona; Di Meglio, E.

Aim of the paper is to propose a Text Mining strategy based on statistical tools, which make more efficient the extraction of information buried in massive quantities of documents. Usually, in Text Mining procedures (such as in textual data analyses) we deal with a corpus consisting of a set of documents. In order to build the data structure to be processed, each document is encoded in a document vector, according to the bag-of-words model, which associates words and their frequencies for the given document. Documents are considered as a whole. The proposed mining strategy identifies interesting sentences in the corpus we deal with, where to concentrate the knowledge extraction. The sentence interest will depend on the researcher’s objective. The proposed procedure is useful when we are interested in local contexts for words. Prior information, i.e. expert knowledge, is included, as an input for the procedure, but differently to content analysis, the key-word system is automatically built. The strategy can be applied in any case we can introduce information for partitioning documents in lower order grammatical units (e.g. sentences, but also paragraphs, etc.). The mining procedure consists in two steps: first of all the Text Categorisation, i.e. the recognition of the interesting sentences, by means of a statistical segmentation procedure, and then the knowledge extraction from the identified sub-texts. The procedure first step produces association rules useful in filtering e-mail, chat, or Web access, too. The paper aims at contributing to the day-by-day wider literature on Text Mining, devoted to go beyond the "bag-of-words" model of structuring the data set in document vectors, enhancing the role of a statistical perspective. An application on Italian on-line job offers ends the paper, showing the effectiveness of the proposal.

A text mining strategy based on local contexts of words / Balbi, S., DI MEGLIO, E.. - STAMPA. - 1:(2004), pp. 79-87. (JADT 2004 7th International Conference on the Statistical Analysis of Textual Data Louvain La Neuve/ Belgique 10-12 MARZO 2004).