Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion

Published 20 Apr 2018 in cs.CL and cs.LG | (1804.07745v3)

Abstract: Continuous word representations learned separately on distinct languages can be aligned so that their words become comparable in a common space. Existing works typically solve a least-square regression problem to learn a rotation aligning a small bilingual lexicon, and use a retrieval criterion for inference. In this paper, we propose an unified formulation that directly optimizes a retrieval criterion in an end-to-end fashion. Our experiments on standard benchmarks show that our approach outperforms the state of the art on word translation, with the biggest improvements observed for distant language pairs such as English-Chinese.

Abstract PDF Upgrade to Chat

Citations (298)

View on Semantic Scholar

Summary

The paper introduces a retrieval-based loss that aligns bilingual word vectors more effectively than traditional least-square regression approaches.
It employs convex relaxations and projected subgradient descent to achieve an average improvement of 3-4% on challenging language pairs.
The approach mitigates hubness issues by integrating unsupervised lexicon data, paving the way for scalable, rotation-free multilingual translation systems.

An Analysis of Bilingual Word Mapping with a Retrieval Criterion

The paper "Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion" introduces an innovative approach to improving the alignment of continuous word vector representations across different languages. The primary focus is on addressing the drawbacks of previous methodologies, which typically rely on least-square regression for aligning bilingual word lexicons. The authors propose a model that optimizes a retrieval criterion directly, delivering enhanced performance on word translation tasks, especially for linguistically distant pairings such as English and Chinese.

Central Concepts and Methodology

Traditional approaches have employed techniques like orthogonal Procrustes for word mapping, which involve solving least square regression problems to learn linear mappings between bilingual lexicons. These mappings are then extrapolated to unobserved words, thus extending the lexicon. While effective to some degree, such approaches often suffer from issues like the hubness problem, where some word vectors appear excessively in the nearest neighbor sets during inference. This issue has prompted the application of corrective metrics such as inverted softmax (ISF) and cross-domain similarity local scaling (CSLS) at the retrieval stage, creating inconsistencies between training and inference.

The authors propose to tightly integrate retrieval criteria into the training stage itself, leveraging the CSLS as the primary loss function instead of relying solely on posterior corrective measures. This approach not only aligns the training and inference phases but also makes full use of unsupervised data, capitalizing on information from unpaired words in the vocabulary. The method introduces convex relaxations of the retrieval criteria, optimized via projected subgradient descent, allowing the learning process to envelop a broader semantic space.

Experimental Evaluation and Results

The experimental results demonstrate significant improvements over state-of-the-art methods across numerous language pairs, with particular efficacy noted in challenging cases such as English-Chinese translations. Comparative analyses with bilingual mapping methods underscore the superiority of the proposed approach, which yields an average performance boost of 3 to 4 percentage points. Another notable discovery is that while traditional models emphasize preserving vector distances through orthogonality constraints, the RCSLS model, when unconstrained, often delivers superior results. This finding challenges conventional wisdom regarding the necessity of preserving Euclidean properties in the mapping matrix.

In the context of unsupervised learning, the proposed model displays robustness even when interfacing with less precise word representations or noisy lexicons, highlighting its adaptability and scalability. Furthermore, the framework demonstrates proficiency in capturing semantic consistency across mappings, evident in its performance on linguistic analogy tasks across various languages.

Implications and Future Directions

The work presented by Joulin et al. contributes significantly to the field of machine translation and cross-lingual representation learning. By aligning the training and inference stages through a consistent retrieval-oriented loss, the model streamlines the alignment process, making it more effective and theoretically sound. The study paves the way for future exploration into non-orthogonal mappings and deeper integrations of unsupervised data in bilingual systems.

The introduction of convex relaxations in conjunction with retrieval criteria opens up promising avenues for developing translation models that are more resilient to variations in linguistic structure and vocabulary. As AI systems increasingly interact with diverse linguistic environments, the principles outlined in this research could inform the development of more robust, adaptable machine translation systems.

In conclusion, this paper offers a substantial advance in bilingual word mapping, not only improving existing methodologies but also setting a foundational framework for future explorations into efficient, scalable multilingual language processing.

Markdown Report Issue