Papers
Topics
Authors
Recent
Search
2000 character limit reached

NV-Retriever-v1: Text Embedding & Retrieval Model

Updated 21 February 2026
  • NV-Retriever-v1 is a state-of-the-art text embedding and retrieval model that uses a bi-encoder Mistral-7B backbone for semantic search and retrieval-augmented generation.
  • It employs a novel positive-aware hard-negative mining algorithm that filters out false negatives, significantly boosting performance metrics like NDCG@10.
  • Its scalable architecture, combined with ONNX and FAISS deployment, ensures efficient real-time retrieval for practical RAG applications.

NV-Retriever-v1 is a state-of-the-art text embedding and retrieval model designed to optimize information retrieval tasks such as semantic search and retrieval-augmented generation (RAG). It is based on a bi-encoder adaptation of the Mistral-7B Transformer decoder, fine-tuned using a novel family of positive-aware hard-negative mining algorithms, achieving first place on the MTEB Retrieval (BEIR) benchmark as of July 2024 (Moreira et al., 2024).

1. Model Architecture

NV-Retriever-v1 adopts the “Mistral-7B” Transformer as its core backbone, using 32 causal-language-model layers. It is converted to a bi-encoder architecture by enabling bidirectional self-attention during embedding and appending a mean-pooling read-out head. The model architecture is identical in bi-encoder construction to NV-Embed-v1 (Lee et al., 2024) and e5-Mistral (Wang et al., 2023). Every attention matrix is augmented with LoRA adapters (rank r=16r = 16, scaling factor α=32\alpha = 32), with all other model parameters frozen during fine-tuning.

Given a sequence of LL tokens, the embedding xR4096x\in\mathbb{R}^{4096} is computed as:

x=1MtMhL[t]x = \frac{1}{|M|} \sum_{t\in M} h_L[t]

where hL[t]h_L[t] is the last-layer hidden state of token tt and MM is the set of non-instruction, non-padding tokens.

Key Architectural Parameters:

Parameter Value
Transformer Layers 32
Hidden/Embedding Size 4096
Attention Bidirectional (fine-tuning)
Max Query Length 192 tokens
Max Passage Length 512 tokens
LoRA Rank / α\alpha 16 / 32

2. Training Objective and Loss Functions

NV-Retriever-v1 employs contrastive learning via the InfoNCE loss, mixing in-batch and hard negatives for each training example. The objective promotes high similarity between the query embedding qiq_i and its positive passage pi+p_i^+ while suppressing similarity to hard negatives pi,jp_{i,j}^- and in-batch negatives.

Loss Function:

Let sim(u,v)=uvsim(u,v) = u^\top v, and temperature τ=0.05\tau=0.05. For batch of NN examples,

i=logexp(sim(qi,pi+)/τ)exp(sim(qi,pi+)/τ)+j=1Khexp(sim(qi,pi,j)/τ)+pinbatchexp(sim(qi,p)/τ)\ell_i = -\log \frac{ \exp\bigl(sim(q_i, p_i^+)/\tau\bigr) }{ \exp\bigl(sim(q_i, p_i^+)/\tau\bigr) + \sum_{j=1}^{K_h} \exp\bigl(sim(q_i, p_{i,j}^-)/\tau\bigr) + \sum_{p^- \in \mathrm{inbatch}} \exp\bigl(sim(q_i,p^-)/\tau\bigr) }

and the full batch loss is

L=1Ni=1Ni.\mathcal{L} = \frac{1}{N} \sum_{i=1}^N \ell_i.

During Stage 1 (retrieval only), each query is paired with one hard negative and in-batch negatives; Stage 2 (multi-task) uses five hard negatives (no in-batch).

3. Positive-Aware Hard-Negative Mining

The central innovation of NV-Retriever-v1 lies in its positive-aware hard-negative mining, which uses the positive relevance score to remove false negatives and anchor thresholds for hard negative selection.

For each (query, positive) pair, a pretrained teacher embedding model produces a positive similarity score:

pos_score=sim(q,p+)pos\_score = sim(q, p^+)

Negatives with similarity scores close to pos_scorepos\_score are likely unlabelled positives and are excluded. In the TopK-PercPos algorithm, negative candidates cc are retained if sim(q,c)margin_pctpos_scoresim(q, c) \leq \mathrm{margin\_pct} \cdot pos\_score (typically margin_pct=0.95=0.95); the top KK satisfy this condition.

TopK-PercPos Algorithm:

1
2
3
4
5
6
for each (q_i, p_i^+) in train_pairs:
    q_emb, pos_emb = teacher_model.encode(q_i, p_i^+)
    pos_score = sim(q_emb, pos_emb)
    S = topM_by_similarity(q_i, C) # M >> K
    S = [c for c in S if sim(q_emb, c) <= margin_pct * pos_score and c != p_i^+]
    hard_negatives_i = S[:K]
Variants include TopK-MarginPos, TopK-Abs, TopK shifted by NN, as well as softmax sampling among the top KK. Across all ablations, “positive-aware” mining (MarginPos or PercPos) yielded higher NDCG@10 vs. fixed or shifted-rank baselines.

4. Teacher and Base Model Selection

Ablation studies contrasted several teacher models for negative mining: BM25 (sparse), e5-large-unsupervised, e5-large-v2, snowflake-arctic-embed-l, e5-mistral-7b-instruct, and NV-Embed-v1. Teacher selection impacts downstream retrieval quality substantially.

Teacher Model Impact (zero-shot BEIR QA, avg. NDCG@10):

Teacher Model Avg. NDCG@10
BM25 0.5002
random negatives 0.5248
e5-large-unsupervised 0.5494
e5-large-v2 0.5704
snowflake-arctic-embed-l 0.5728
NV-Embed-v1 0.5744
e5-mistral-7b-instruct 0.5810

Using a 7B-parameter Mistral-based teacher outperformed sparse retrieval and BERT-backbone models by up to 8 points (NDCG@10), suggesting significant teacher effects on mining efficacy.

5. Ablation Studies on Hard-Negative Mining

Extensive ablations addressed the performance of various mining strategies across teacher and base model permutations. When fine-tuning e5-large-unsupervised (334M parameters, 4 negatives), the TopK-PercPos and TopK-PercPos+sampling consistently achieved the highest NDCG@10 (0.5856–0.5857), outperforming Naive Top-K (0.5407) and fixed margin approaches.

Mining Method Comparison (e5-large-unsupervised, avg. NDCG@10):

Mining Method Config Avg. NDCG@10
Naive Top-K 0.5407
Top-K shifted by N N=10 0.5695
TopK-Abs θ\theta=0.70 0.5759
TopK-MarginPos δ\delta=0.05 0.5835
TopK-PercPos 95% 0.5856
TopK-PercPos+sampling (top-10) softmax k=10 0.5856
Top1+sampled(3 from top-10) softmax k=10 0.5857

On the Mistral-7B-v0.1 base with 1 hard negative, TopK-PercPos+sampling reached 0.6499 vs. 0.6214 for Naive Top-K. The consistent benefits of positive-aware mining over fixed or shifted-rank methods indicate its general utility.

6. Training Protocols and Data

NV-Retriever-v1 training comprises two sequential stages:

Stage 1 (retrieval-only):

  • Datasets: ArguAna, BioASQ, FEVER, FiQA, GOOAQ, HotpotQA, MS-MARCO, NFCorpus, NLI, Natural Questions, PAQ, SciFacts, SQuAD, StackExchange, TriviaQA (\sim670K queries)
  • Optimizer: AdamW, lr=1×105=1\times10^{-5}, warmup=100 steps
  • Batch size: 32 (gradient accumulation 4 \rightarrow effective 128)
  • Negatives: 1 hard negative + in-batch negatives
  • Epochs: 12 (\sim34M steps), A100 GPUs, mixed precision

Stage 2 (classification/multitask):

  • Same datasets plus classification (Banking77, Amazon Reviews, Emotion, IMDB, MTOP, Toxic, TweetSentiment) and STS/regression/clustering
  • Batch size: 8 (accumulation 16 \rightarrow effective 128), 5 hard negatives (no in-batch), 12 epochs (\sim10M steps)

Preprocessing: Query inputs are prefixed with instruction strings (masked out for pooling). Text is lowercased and tokenized via HuggingFace’s Mistral BPE. Passages receive no prefix.

7. Benchmark Results and Deployment

On the July 2024 MTEB Retrieval (15 BEIR datasets), NV-Retriever-v1 achieves 1st place with avg. NDCG@10 = 60.90, surpassing gte-Qwen2-7B-instruct (60.25) and Linq-Embed-Mistral (60.19). The per-dataset scores and full results are reported in Table 7 of the source, with an observed advantage of +0.65 points over the prior best (Moreira et al., 2024).

Model leaderboard excerpt:

Model Avg. NDCG@10
NV-Retriever-v1 60.90
gte-Qwen2-7B-instruct 60.25
Linq-Embed-Mistral 60.19
SFR-Embedding-2_R 60.18
NV-Embed-v1 59.36

For practical deployment, embeddings are exported to ONNX for real-time retrieval and indexed with FAISS (IVF-PQ or HNSW). In a RAG setting, user queries are embedded, top-kk passages retrieved, and context fed into an LLM for downstream generation.

8. Code Availability and Reproducibility

NV-Retriever-v1 is released under Apache 2.0 on HuggingFace: https://huggingface.co/nvidia/NV-Retriever-v1. The repository includes full PyTorch/Transformers+PEFT scripts for training, LoRA configuration, and data processing. Reproducibility steps:

  1. Download and reformat datasets.
  2. Precompute negatives using compute_negatives.py with TopK-PercPos (95%).
  3. Execute train_stage1.py then train_stage2.py for full training as specified in Table 12.
  4. Deploy using ONNX and FAISS as described.

All code, config files, and data recipes are publicly available, supporting end-to-end reproduction of the model and its methodology.

References

NV-Retriever-v1 and its underlying positive-aware hard-negative mining framework are documented in "NV-Retriever: Improving text embedding models with effective hard-negative mining" (Moreira et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NV-Retriever-v1 Model.