Rotated canonical correlation analysis for multilingual corpora

Balbi, Simona; Misuraca, M.

This paper aims at proposing the joint use of Canonical Correlation Analysis and Procrustes Rotations (RCA), when we deal with a text and its translation into another language. The basic idea is representing words in the two different natural languages on a common reference space. The main characteristic of this space is to be lan-guage independent, although Procrustes Rotation is performed transforming the lexical table derived from trans-lation by minimizing its distance from the lexical table belonging to the original corpus, while the subsequent Canonical Correlation Analysis treats symmetrically the two word sets. The most interesting RCA feature is building a unique reference space for representing the correlation structure in the data, inducing the two systems of canonical factors to lie on the same space. These graphical representations enables us to read distances be-tween corresponding points in terms of different way of translating the same word in relation with the general context defined by the canonical variates. Trying to understand the distances between matched points could rep-resent an useful tool for enriching lexical resources in a translation procedure. In this paper we propose the com-parison of the most frequent content bearing words in the two languages, analyzing one year (2003) of Le Monde Diplomatique and its Italian edition.

Rotated canonical correlation analysis for multilingual corpora / Balbi, Simona; M., Misuraca. - STAMPA. - 1:(2006), pp. 99-106. ( JADT'06 8e Journées internationales d'analyse statistique des données textuelles Besançon (F) 19-21 aprile 2006).