In the field of Software Maintenance the definition of effective approaches to partition a software system into meaningful subsystems is a longstanding and relevant research topic. These techniques are very important as they can significantly support a Maintainer in his/her tasks by grouping related entities of a large system into smaller and easier to comprehend subsystems. In this paper we investigate the effectiveness of combining information retrieval and machine learning techniques in order to exploit the lexical information provided by programmers for software clustering. In particular, differently from any related work, we employ indexing techniques to explore the contribution of the combined use of six different dictionaries, corresponding to the six parts of the source code where programmers introduce lexical information, namely: class, attribute, method and parameter names, comments, and source code statements. Moreover their relevance is estimated on the basis of the project characteristics, by applying a machine learning approach based on a probabilistic model and on the Expectation-Maximization algorithm. To group source files accordingly, two clustering algorithms have been compared, i.e. the K-Medoids and the Group Average Agglomerative Clustering, and the investigation has been conducted on a dataset of 9 open source Java software systems.

Combining Machine Learning and Information Retrieval Techniques for Software Clustering / Corazza, Anna; DI MARTINO, Sergio; Maggio, Valerio; Giuseppe, Scanniello. - 255:(2012), pp. 42-60. (Intervento presentato al convegno 1st International Workshop on Eternal Systems, EternalS 2011 tenutosi a Budapest; Hungary nel 3 May 2011) [10.1007/978-3-642-28033-7_5].

Combining Machine Learning and Information Retrieval Techniques for Software Clustering

CORAZZA, ANNA;DI MARTINO, SERGIO;MAGGIO, VALERIO;
2012

Abstract

In the field of Software Maintenance the definition of effective approaches to partition a software system into meaningful subsystems is a longstanding and relevant research topic. These techniques are very important as they can significantly support a Maintainer in his/her tasks by grouping related entities of a large system into smaller and easier to comprehend subsystems. In this paper we investigate the effectiveness of combining information retrieval and machine learning techniques in order to exploit the lexical information provided by programmers for software clustering. In particular, differently from any related work, we employ indexing techniques to explore the contribution of the combined use of six different dictionaries, corresponding to the six parts of the source code where programmers introduce lexical information, namely: class, attribute, method and parameter names, comments, and source code statements. Moreover their relevance is estimated on the basis of the project characteristics, by applying a machine learning approach based on a probabilistic model and on the Expectation-Maximization algorithm. To group source files accordingly, two clustering algorithms have been compared, i.e. the K-Medoids and the Group Average Agglomerative Clustering, and the investigation has been conducted on a dataset of 9 open source Java software systems.
2012
9783642280320
Combining Machine Learning and Information Retrieval Techniques for Software Clustering / Corazza, Anna; DI MARTINO, Sergio; Maggio, Valerio; Giuseppe, Scanniello. - 255:(2012), pp. 42-60. (Intervento presentato al convegno 1st International Workshop on Eternal Systems, EternalS 2011 tenutosi a Budapest; Hungary nel 3 May 2011) [10.1007/978-3-642-28033-7_5].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/404476
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 0
social impact