This paper aims at proposing the joint use of Canonical Correlation Analysis and Procrustes Rotations (RCA), when we deal with a text and its translation into another language. The basic idea is representing words in the two different natural languages on a common reference space. The main characteristic of this space is to be lan-guage independent, although Procrustes Rotation is performed transforming the lexical table derived from trans-lation by minimizing its distance from the lexical table belonging to the original corpus, while the subsequent Canonical Correlation Analysis treats symmetrically the two word sets. The most interesting RCA feature is building a unique reference space for representing the correlation structure in the data, inducing the two systems of canonical factors to lie on the same space. These graphical representations enables us to read distances be-tween corresponding points in terms of different way of translating the same word in relation with the general context defined by the canonical variates. Trying to understand the distances between matched points could rep-resent an useful tool for enriching lexical resources in a translation procedure. In this paper we propose the com-parison of the most frequent content bearing words in the two languages, analyzing one year (2003) of Le Monde Diplomatique and its Italian edition.

Rotated canonical correlation analysis for multilingual corpora

BALBI, SIMONA;
2006

Abstract

This paper aims at proposing the joint use of Canonical Correlation Analysis and Procrustes Rotations (RCA), when we deal with a text and its translation into another language. The basic idea is representing words in the two different natural languages on a common reference space. The main characteristic of this space is to be lan-guage independent, although Procrustes Rotation is performed transforming the lexical table derived from trans-lation by minimizing its distance from the lexical table belonging to the original corpus, while the subsequent Canonical Correlation Analysis treats symmetrically the two word sets. The most interesting RCA feature is building a unique reference space for representing the correlation structure in the data, inducing the two systems of canonical factors to lie on the same space. These graphical representations enables us to read distances be-tween corresponding points in terms of different way of translating the same word in relation with the general context defined by the canonical variates. Trying to understand the distances between matched points could rep-resent an useful tool for enriching lexical resources in a translation procedure. In this paper we propose the com-parison of the most frequent content bearing words in the two languages, analyzing one year (2003) of Le Monde Diplomatique and its Italian edition.
9782848671307
File in questo prodotto:
File Dimensione Formato  
jadt2006.pdf

accesso aperto

Tipologia: Documento in Post-print
Licenza: Dominio pubblico
Dimensione 224.42 kB
Formato Adobe PDF
224.42 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/359560
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact