Two-Tower Retrieval Architecture

Updated 30 January 2026

Two-tower retrieval architectures are dual-encoder models that project queries and items into a common embedding space for efficient similarity search.
They leverage contrastive learning with in-batch and cross-batch negative sampling to achieve robust retrieval performance and sub-millisecond latency.
Extensions such as early-interaction modules and dynamic index alignment enhance expressiveness and mitigate issues like representation drift in large-scale deployments.

A two-tower retrieval architecture, also known as a dual-encoder or Siamese network in certain settings, is a foundational paradigm for large-scale retrieval, recommendations, and dense information retrieval. It consists of two parallel neural networks—one for queries (users, questions, etc.) and one for items (products, documents, passages, etc.)—that project their respective inputs into a shared embedding space. Retrieval is then performed efficiently via approximate nearest neighbor (ANN) search over pre-computed item embeddings. This design underpins modern web-scale recommenders, search engines, and cross-modal retrieval systems.

1. Architectural Foundations and Workflow

The canonical two-tower architecture is defined by two neural encoders: a query tower $f_q$ and an item tower $f_i$ , each parameterized and trained to produce $d$ -dimensional embeddings. For an input pair $(q, i)$ , the model computes their similarity via an inner product or cosine similarity in the embedding space: $\text{score}(q, i) = f_q(q)^\top f_i(i)$ Inputs on both sides may be multimodal (text, categorical features, images) and are typically processed via embedding layers, attention, and feed-forward MLPs (Osowska-Kurczab et al., 19 Jul 2025, Wang et al., 2021).

At serving time, item embeddings are pre-computed and stored in a high-performance ANN index (e.g., Faiss IVF-PQ, HNSW). At query time, only the query encoder is run online, after which candidates are retrieved by efficient similarity search (Wang et al., 15 Dec 2025, Osowska-Kurczab et al., 19 Jul 2025).

2. Representation Learning and Training Objectives

Training two-tower architectures relies on contrastive or softmax-based objectives. The typical setup uses positive pairs (e.g., user–clicked item, query–relevant passage) and negative sampling: $\mathcal{L} = -\log \frac{ \exp\left( s(f_q(q), f_i(i^+)) / \tau \right) } { \sum_{i^-} \exp\left( s(f_q(q), f_i(i^-)) / \tau \right) }$ where positives are drawn from logs, and negatives from either in-batch candidates (Moiseev et al., 2023, Wang et al., 2021), cross-batch memory (Wang et al., 2021), or via hard negative mining.

Variants such as symmetric dual-encoders (shared weights, "Siamese") and asymmetric dual-encoders (independent parameters) are common in passage retrieval and QA (Moiseev et al., 2023, Liang et al., 2020). Loss regularization—e.g., SamToNe's incorporation of "same-tower" negatives—acts as an explicit embedding space regularizer for better alignment (Moiseev et al., 2023).

3. Scaling, Efficiency, and Real-World Deployment

The two-tower architecture is engineered for production-scale efficiency. Offline, the expensive operations are amortized by precomputing and storing all item vectors. Online, a single inference is required to embed the query, and candidates are retrieved by fast approximate search in the item embedding space (Wang et al., 15 Dec 2025, Osowska-Kurczab et al., 19 Jul 2025, Chen et al., 2024). This yields sub-millisecond latency for billion-scale corpora and is robust to updates in the item corpus (Wang et al., 2021, Huang et al., 2024).

Scaling dual-encoders to large model and dataset sizes is tractable—LLMs can be used as strong backbones, with subsequent tower-specific distillation to meet latency constraints. For instance, ScalingNote demonstrates that a 4-layer student query-tower distilled from a 7B LLM preserves >95% of retrieval accuracy at >100× higher QPS (Huang et al., 2024).

Recent advances address scaling laws in both data and model size, enabling sustained accuracy gains up to hundreds of millions of pairs and multi-billion parameter encoders (Huang et al., 2024).

4. Extensions: Interaction Mechanisms and Index Consistency

Two-tower models are architecturally “late-interaction”—user and item features do not mix until the similarity stage—which offers efficiency but limits modeling of higher-order cross-features (Wang et al., 15 Dec 2025, Rangadurai et al., 2024, Li et al., 2022). Extensions address this:

Early and Middle Interaction Modules: Architectures such as IntTower (Li et al., 2022) or FIT (Xiong et al., 16 Sep 2025) inject early or mid-network attention/fusion (e.g., FE-Block, meta-query modules) to increase representation expressiveness without sacrificing deployment efficiency.
Hierarchical, Modular, or Sparse Decoupling: Methods like HSNN (Rangadurai et al., 2024) and SparCode (Su et al., 2023) integrate modular neural units, hierarchical clustering, or sparse code-based inverted indices to increase interaction complexity and retrieval specificity while maintaining sub-linear query complexity.
Index Alignment: Standard practice builds ANN clusters/coarse indices on item-tower outputs, but misalignment or anisotropy between tower embeddings causes retrieval inconsistency. SCI (Wang et al., 15 Dec 2025) introduces symmetric input-swapping loss for tower alignment and index construction in the shared (query-tower) space, theoretically guaranteeing retrieval consistency and reducing performance cliffs on tail queries.

5. Negative Sampling, Data Augmentation, and Learning Strategies

Sampling strategy for negatives is a pivotal factor. In-batch sampling is standard but limited by batch size. Cross-batch negative sampling (CBNS) maintains a FIFO queue of negatives across mini-batches, exploiting embedding stability to magnify negative diversity and accelerate convergence (Wang et al., 2021).

Synthetic data generation via query generation (e.g., BART-based for passage retrieval) enables fully unsupervised or zero-shot two-tower training. This can result in models that outperform strong lexical baselines (BM25) and rival training on real labels (Liang et al., 2020).

Conditional retrieval—by injecting condition embeddings directly into the query tower—allows for user–item–condition retrieval (e.g., topic-conditioned recommendation) using only generic user–item logs, which can be efficiently deployed at scale (Lin et al., 22 Aug 2025).

6. Empirical Outcomes, Metrics, and Theoretical Guarantees

Two-tower architectures are validated across real-world platforms (e.g. Allegro.com (Osowska-Kurczab et al., 19 Jul 2025), Pinterest (Lin et al., 22 Aug 2025)). Empirical A/B tests demonstrate statistically significant uplifts (CTR, GMV, engagement) with minimal maintenance load and robust serving at scale.

Unified frameworks such as LT-TTD (Abraich, 7 May 2025) combine two-tower retrieval and cross-candidate re-ranking in a single architecture, exploiting knowledge distillation to reduce error propagation and provably improving global optimum and ranking quality (NDCG), while maintaining total complexity $O(d \log N + k^2 d)$ .

Theoretical analyses provide guarantees:

Representation alignment and index consistency: SCI aligns and quantizes both towers for stable ANN routing and improved recall, especially under finite probe budgets (Wang et al., 15 Dec 2025).
Scaling laws: Retrieval performance tightly follows power-law scaling in model and data size, modulated by efficient distillation (Huang et al., 2024).
Negative sampling error: Cross-batch negatives introduce bounded estimation errors provided embedding drift is controlled (Wang et al., 2021).

7. Limitations, Open Challenges, and Future Directions

Despite their scalability, two-tower frameworks remain restricted by:

Expressiveness: Late interaction restricts modeling of complex cross-features; hybrid and modular architectures introduce richer signal mixing at the cost of marginal extra latency (Li et al., 2022, Xiong et al., 16 Sep 2025, Rangadurai et al., 2024).
Representation/Index Drift: Model updates and item churn can cause misalignment between representations and indices, necessitating periodic retraining and reindexing (Rangadurai et al., 2024, Wang et al., 15 Dec 2025).
Cold-Start and Heterogeneous Data: Item-side attention fusion (text, categorical, image) as in Amazon’s Prime Video (Wang et al., 2021) and dynamic injection of synthetic queries (Liang et al., 2020) mitigate cold-start issues, but full coverage in diverse modalities remains open.
Online Adaptivity and Learning: Efficient strategies for online model refreshes, fully joint user–item–context modeling (including multi-hop relational graphs (Tan et al., 13 Jan 2026)), and distillation from LLMs remain major research axes.

A plausible implication is that the two-tower paradigm will remain central to web-scale search, recommendation, and cross-modal retrieval, with ongoing innovation targeting tighter alignment, higher expressiveness, and more robust adaptation at industry scale. Key future directions include systematic integration of cross-interaction modules, full-graph and multi-hop reasoning frameworks, and deeper alignment between model training, index construction, and serving infrastructure.