Word Embeddings for Comment Coherence

Cimasa, Alfonso; Corazza, Anna; Coviello, Carmen; Scanniello, Giuseppe

Information in source code comments and identifiers names represent a valuable resource for programmers to maintain and evolve software. During the evolution of a software it could happen that the information in comments and the corresponding source code is not aligned, so hampering the execution of software evolution and maintenance tasks. This kind of misalignment is known as lack of coherence and can happen for several reasons, e.g., programmers modify the intent of source code while executing a maintenance task without updating its comment accordingly. We study the problem of detecting lack of coherence between comments and source code by exploiting Word Embeddings (WEs), a tool which has shown to be very effective in natural language processing. We introduce four models based on WEs and tested them using six different WE variants. These models and WEs have been empirically assessed through an experiment conducted on a publicly available dataset and compared them with a baseline approach. The results indicate that, while maintaining performance very close to the baseline, the considered models and WE variants are more efficient in terms of execution time. The explanation for such an improvement is that WEs are able to concentrate the important information in a much more compact representation of the input. This represents one of the most important take-away lesson from our experiment.

Word Embeddings for Comment Coherence / Cimasa, A., Corazza, A., Coviello, C., Scanniello, G.. - (2019), pp. 244-251. (45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) Kallithea-Chalkidiki, Greece 28-30 agosto 2019).