Multilingual Bi-Encoders

Updated 6 January 2026

Multilingual bi-encoders are neural architectures that embed inputs from various languages into a shared semantic space using dual encoders.
They leverage contrastive learning and Transformer models like mBERT/XLM-RoBERTa to align cross-language representations effectively.
This design decouples offline candidate encoding from runtime query processing, enabling efficient, scalable cross-modal and cross-lingual retrieval.

A multilingual bi-encoder is a neural architecture consisting of two independently operating encoders, each preprocesses its input sequence (e.g., query and candidate text, image and caption, or speech input) into a fixed-dimensional vector in a shared semantic space. Unlike traditional cross-encoder designs, bi-encoders precompute candidate representations offline and compare them to queries at runtime using a similarity metric such as dot product or cosine similarity. In the multilingual setting, a single bi-encoder is tasked with embedding inputs from multiple languages (and sometimes multiple modalities) so that semantically corresponding items—across languages and possibly modalities—occupy proximate locations in the latent space, enabling rapid and robust cross-lingual retrieval and matching.

1. Core Multilingual Bi-Encoder Architectures

The canonical architecture employs two towers (“Siamese structure”), each typically instantiated by a Transformer variant pretrained on multilingual corpora (e.g., mBERT, XLM-RoBERTa). Parameter sharing between towers is customary, though some approaches use independent encoders for source and target sides. The input is tokenized by a shared subword vocabulary and embedded, usually with positional and segment embeddings for textual inputs (Lavi, 2021, Bruyn et al., 2021, Yang et al., 2019, Hu et al., 2020).

Pooling strategies for final embeddings include using the [CLS] token, average pooling over token-level outputs, or more sophisticated projections via linear or multilayer heads. For multimodal retrieval (e.g., image-text, speech-text), domain-specific encoder architectures are employed—such as Vision Transformers for images or Whisper/MMS for audio—followed by adapter layers and task-specific fusion mechanisms (Xue et al., 2024, Guo et al., 2024).

A representative text-only multilingual bi-encoder pipeline:

Stage	Example Component	Dimensions
Input	mBERT/XLM-R, WordPiece	ℝⁿ (tokens)
Encoding	12-Layer Transformer	ℝⁿ×768
Pooling	[CLS], avg/mean	ℝ⁷⁶⁸ or ℝ⁵⁰⁰
Projection	Linear/MLP (optional)	ℝᵈ
Similarity Scoring	Dot/cosine	Scalar

In multimodal cases, pools of encoder outputs for different input types are gated and fused according to language- or modality-specific selectors (Xue et al., 2024, Guo et al., 2024).

2. Training Objectives and Contrastive Losses

Bi-encoders are nearly always trained via contrastive loss functions that drive the embeddings for positive pairs (translation pairs, FAQ pairs, CV-vacancy matches, etc.) to be more similar than those for negative pairs. The dominant formulation is the in-batch softmax or InfoNCE loss (Lavi, 2021, Bruyn et al., 2021, Yang et al., 2019, Wu et al., 2020, Hu et al., 2020):

$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \log \frac{\exp(\text{sim}(e_i^A, e_i^B)/\tau)}{\sum_{j=1}^N \exp(\text{sim}(e_i^A, e_j^B)/\tau)}$

where $e_i^A$ , $e_i^B$ are the encoded vectors of pair $i$ , $\tau$ is a temperature, and negatives are drawn from all other batch candidates. In addition, additive margin softmax (Yang et al., 2019) and explicit alignment losses at both sentence and token granularity (AMBER: $\mathcal{L}_{\text{sent}}$ , $\mathcal{L}_{\text{tok}}$ ) have improved stability and discriminative power in cross-lingual settings (Hu et al., 2020).

Negative sampling strategies range from in-batch negatives (computationally efficient), hard negatives sampled from top BM25/FAISS candidates, or batching semantically similar items (e.g., all FAQ pairs from one page) to stress fine distinction (Bruyn et al., 2021, Zhang et al., 2022). In multimodal settings, grouped InfoNCE with distributed aggregation is used to scale training to billions of examples (Guo et al., 2024).

3. Multilingual Alignment, Data Handling, and Cross-Lingual Transfer

Multilingual bi-encoders derive cross-lingual generalization primarily from Transformer backbones pretrained on joint multilingual corpora with shared subword vocabularies (Lavi, 2021, Zhang et al., 2022, Yang et al., 2019). This produces anchor points for entity names and structural tokens, facilitating transfer even for low-resource languages or differing scripts. Fine-tuning on multilingual parallel data further aligns embeddings across languages; explicit alignment losses (e.g., AMBER—sentence-level and token-level) result in additional gains for sequence tagging and retrieval, most pronounced in low-resource languages (Hu et al., 2020, Vulić et al., 2022).

Cross-lingual transfer is empirically robust: fine-tuning on any language improves target-language retrieval, and joint multilingual fine-tuning reliably outperforms monolingual approaches except in certain high-resource cases (Bruyn et al., 2021, Zhang et al., 2022). Multilingual bi-encoders also permit efficient code-switching and handling entity-rich retrieval where the same representation must serve many scripts or domains.

Tables below compare performance for monolingual and multilingual bi-encoder models on FAQ retrieval (Bruyn et al., 2021):

Language	Monolingual MRR	Multilingual MRR
English	82.9	82.5
German	81.1	81.3
Spanish	78.0	81.7
French	71.0	80.7
Dutch	70.4	81.2
Russian	71.6	82.1

4. Scalability, Retrieval, and Practical Deployment

The bi-encoder design decouples inference from candidate encoding: all candidates (documents, answers, jobs, images) are embedded and indexed offline. At runtime only the query is encoded, and nearest-neighbor search over vector databases (e.g., FAISS, HNSW, IVF-PQ) retrieves top-k matches (Lavi, 2021, Guo et al., 2024, Yang et al., 2019). This pipeline achieves sub-millisecond per-query retrieval at scale (10⁶–10⁷ candidates), with batchable encoding and CPU-querying.

Grouped aggregation and gradient accumulation allow training on up to trillions of pairs by limiting GPU memory and communication (Guo et al., 2024). For document-level retrieval, averaging sentence embeddings is highly effective and matches prior heavily engineered solutions (Yang et al., 2019).

Multilingual bi-encoder pipelines are maintainable: a single model supports all supported languages, avoids per-language forks, and can be periodically retrained on fresh matched pairs or new language data (Lavi, 2021, Zhang et al., 2022).

5. Empirical Results and Cross-Task Performance

Multilingual bi-encoders surpass lexical and monolingual dense baselines across retrieval, classification, and mining tasks. Notable empirical results:

CV–vacancy matching: recall@50 improved from ~25% (TF-IDF baseline) to ~65%; cross-language matching saw relative lift of +40% compared to monolingual models (Lavi, 2021).
FAQ retrieval on MFAQ: multilingual XLM-RoBERTa outperformed monolingual per-language baselines by 5–15 MRR points for most languages, with ~6M FAQ pairs (Bruyn et al., 2021).
UN parallel corpus: bi-directional dual-encoder with additive margin softmax yielded P@1 ≥ 86% for sentence retrieval, ~97% document retrieval (Yang et al., 2019).
Image-text retrieval: M²-Encoder-10B reached 88.5% top-1 accuracy on ImageNet (English), 80.7% on ImageNet-CN (Chinese), and ≥91.2% recall@1 across major retrieval benchmarks (Guo et al., 2024).
Speech-to-text: Ideal-LLM reduced ASR WER by 32.6% over baseline, BLEU scores for AST up to 36.78, attributing gains to dual-encoder fusion and language-adapted gating (Xue et al., 2024).

Bi-encoder models also minimize nationality-based disparity by reducing language as a proxy feature in hiring applications (Lavi, 2021).

6. Limitations, Alignment Controversies, and Analysis

Explicit alignment objectives (e.g., word-level L2 distance, linear mapping) yield only marginal or statistically insignificant improvements over strong multilingual pretraining and larger model capacity (Wu et al., 2020). Newly proposed contrastive alignment objectives are more robust to alignment noise but their observed gains fall within standard deviation of no alignment, especially for XLM-R-large models. Empirical analysis recommends prioritizing larger multilingual LLMs and robust multi-task evaluation over alignment engineering.

A residual issue is model brittleness—multilingual bi-encoders are sometimes overly surface-keyword dependent, exhibiting suboptimal robustness to paraphrase and adversarial rewording (Bruyn et al., 2021). Extensions such as additive margin softmax and bidirectionality improve discrimination and recall, but further research is warranted on phrase-level and context-adaptive retrieval.

7. Recommended Practices and Future Directions

Recommended practices for multilingual dense retrieval include starting from large multilingual pre-trained encoders (mBERT, XLM-R), leveraging multi-stage fine-tuning (large English collection followed by in-language or cross-language finetuning), using shared subword vocabularies, and relying on in-batch negative sampling for scalable training (Zhang et al., 2022). Hybrid approaches with sparse retrieval (BM25) consistently outperform either method alone.

For extension to multimodal and low-resource scenarios, domain-specific encoder fusion (as in Ideal-LLM or M²-Encoder), balanced data distribution, and grouping/aggregation methods facilitate large-scale and multi-language coverage. Explicit alignment regularization may be most beneficial for smaller models or low-resource settings, but scaling data and model size remain empirically superior strategies (Guo et al., 2024, Hu et al., 2020, Vulić et al., 2022).

Future work involves expanding to more languages, incorporating phrase- and entity-level alignment objectives, enhancing robustness, and extending grouped bi-encoder architectures to audio, video, and cross-modal retrieval (Guo et al., 2024, Xue et al., 2024, Vulić et al., 2022).