NV-Retriever-v1: Text Embedding & Retrieval Model

Updated 21 February 2026

NV-Retriever-v1 is a state-of-the-art text embedding and retrieval model that uses a bi-encoder Mistral-7B backbone for semantic search and retrieval-augmented generation.
It employs a novel positive-aware hard-negative mining algorithm that filters out false negatives, significantly boosting performance metrics like NDCG@10.
Its scalable architecture, combined with ONNX and FAISS deployment, ensures efficient real-time retrieval for practical RAG applications.

NV-Retriever-v1 is a state-of-the-art text embedding and retrieval model designed to optimize information retrieval tasks such as semantic search and retrieval-augmented generation (RAG). It is based on a bi-encoder adaptation of the Mistral-7B Transformer decoder, fine-tuned using a novel family of positive-aware hard-negative mining algorithms, achieving first place on the MTEB Retrieval (BEIR) benchmark as of July 2024 (Moreira et al., 2024).

1. Model Architecture

NV-Retriever-v1 adopts the “Mistral-7B” Transformer as its core backbone, using 32 causal-language-model layers. It is converted to a bi-encoder architecture by enabling bidirectional self-attention during embedding and appending a mean-pooling read-out head. The model architecture is identical in bi-encoder construction to NV-Embed-v1 (Lee et al., 2024) and e5-Mistral (Wang et al., 2023). Every attention matrix is augmented with LoRA adapters (rank $r = 16$ , scaling factor $\alpha = 32$ ), with all other model parameters frozen during fine-tuning.

Given a sequence of $L$ tokens, the embedding $x\in\mathbb{R}^{4096}$ is computed as:

$x = \frac{1}{|M|} \sum_{t\in M} h_L[t]$

where $h_L[t]$ is the last-layer hidden state of token $t$ and $M$ is the set of non-instruction, non-padding tokens.

Key Architectural Parameters:

Parameter	Value
Transformer Layers	32
Hidden/Embedding Size	4096
Attention	Bidirectional (fine-tuning)
Max Query Length	192 tokens
Max Passage Length	512 tokens
LoRA Rank / $\alpha$	16 / 32

2. Training Objective and Loss Functions

NV-Retriever-v1 employs contrastive learning via the InfoNCE loss, mixing in-batch and hard negatives for each training example. The objective promotes high similarity between the query embedding $q_i$ and its positive passage $p_i^+$ while suppressing similarity to hard negatives $p_{i,j}^-$ and in-batch negatives.

Loss Function:

Let $sim(u,v) = u^\top v$ , and temperature $\tau=0.05$ . For batch of $N$ examples,

$\ell_i = -\log \frac{ \exp\bigl(sim(q_i, p_i^+)/\tau\bigr) }{ \exp\bigl(sim(q_i, p_i^+)/\tau\bigr) + \sum_{j=1}^{K_h} \exp\bigl(sim(q_i, p_{i,j}^-)/\tau\bigr) + \sum_{p^- \in \mathrm{inbatch}} \exp\bigl(sim(q_i,p^-)/\tau\bigr) }$

and the full batch loss is

$\mathcal{L} = \frac{1}{N} \sum_{i=1}^N \ell_i.$

During Stage 1 (retrieval only), each query is paired with one hard negative and in-batch negatives; Stage 2 (multi-task) uses five hard negatives (no in-batch).

3. Positive-Aware Hard-Negative Mining

The central innovation of NV-Retriever-v1 lies in its positive-aware hard-negative mining, which uses the positive relevance score to remove false negatives and anchor thresholds for hard negative selection.

For each (query, positive) pair, a pretrained teacher embedding model produces a positive similarity score:

$pos\_score = sim(q, p^+)$

Negatives with similarity scores close to $pos\_score$ are likely unlabelled positives and are excluded. In the TopK-PercPos algorithm, negative candidates $c$ are retained if $sim(q, c) \leq \mathrm{margin\_pct} \cdot pos\_score$ (typically margin_pct $=0.95$ ); the top $K$ satisfy this condition.

TopK-PercPos Algorithm:

for each (q_i, p_i^+) in train_pairs:
    q_emb, pos_emb = teacher_model.encode(q_i, p_i^+)
    pos_score = sim(q_emb, pos_emb)
    S = topM_by_similarity(q_i, C) # M >> K
    S = [c for c in S if sim(q_emb, c) <= margin_pct * pos_score and c != p_i^+]
    hard_negatives_i = S[:K]

Variants include TopK-MarginPos, TopK-Abs, TopK shifted by

N

, as well as softmax sampling among the top

K

. Across all ablations, “positive-aware” mining (MarginPos or PercPos) yielded higher NDCG@10 vs. fixed or shifted-rank baselines.

4. Teacher and Base Model Selection

Ablation studies contrasted several teacher models for negative mining: BM25 (sparse), e5-large-unsupervised, e5-large-v2, snowflake-arctic-embed-l, e5-mistral-7b-instruct, and NV-Embed-v1. Teacher selection impacts downstream retrieval quality substantially.

Teacher Model Impact (zero-shot BEIR QA, avg. NDCG@10):

Teacher Model	Avg. NDCG@10
BM25	0.5002
random negatives	0.5248
e5-large-unsupervised	0.5494
e5-large-v2	0.5704
snowflake-arctic-embed-l	0.5728
NV-Embed-v1	0.5744
e5-mistral-7b-instruct	0.5810

Using a 7B-parameter Mistral-based teacher outperformed sparse retrieval and BERT-backbone models by up to 8 points (NDCG@10), suggesting significant teacher effects on mining efficacy.

5. Ablation Studies on Hard-Negative Mining

Extensive ablations addressed the performance of various mining strategies across teacher and base model permutations. When fine-tuning e5-large-unsupervised (334M parameters, 4 negatives), the TopK-PercPos and TopK-PercPos+sampling consistently achieved the highest NDCG@10 (0.5856–0.5857), outperforming Naive Top-K (0.5407) and fixed margin approaches.

Mining Method Comparison (e5-large-unsupervised, avg. NDCG@10):

Mining Method	Config	Avg. NDCG@10
Naive Top-K	–	0.5407
Top-K shifted by N	N=10	0.5695
TopK-Abs	$\theta$ =0.70	0.5759
TopK-MarginPos	$\delta$ =0.05	0.5835
TopK-PercPos	95%	0.5856
TopK-PercPos+sampling (top-10)	softmax k=10	0.5856
Top1+sampled(3 from top-10)	softmax k=10	0.5857

On the Mistral-7B-v0.1 base with 1 hard negative, TopK-PercPos+sampling reached 0.6499 vs. 0.6214 for Naive Top-K. The consistent benefits of positive-aware mining over fixed or shifted-rank methods indicate its general utility.

6. Training Protocols and Data

NV-Retriever-v1 training comprises two sequential stages:

Stage 1 (retrieval-only):

Datasets: ArguAna, BioASQ, FEVER, FiQA, GOOAQ, HotpotQA, MS-MARCO, NFCorpus, NLI, Natural Questions, PAQ, SciFacts, SQuAD, StackExchange, TriviaQA ( $\sim$ 670K queries)
Optimizer: AdamW, lr $=1\times10^{-5}$ , warmup=100 steps
Batch size: 32 (gradient accumulation 4 $\rightarrow$ effective 128)
Negatives: 1 hard negative + in-batch negatives
Epochs: 12 ( $\sim$ 34M steps), A100 GPUs, mixed precision

Stage 2 (classification/multitask):

Same datasets plus classification (Banking77, Amazon Reviews, Emotion, IMDB, MTOP, Toxic, TweetSentiment) and STS/regression/clustering
Batch size: 8 (accumulation 16 $\rightarrow$ effective 128), 5 hard negatives (no in-batch), 12 epochs ( $\sim$ 10M steps)

Preprocessing: Query inputs are prefixed with instruction strings (masked out for pooling). Text is lowercased and tokenized via HuggingFace’s Mistral BPE. Passages receive no prefix.

7. Benchmark Results and Deployment

On the July 2024 MTEB Retrieval (15 BEIR datasets), NV-Retriever-v1 achieves 1st place with avg. NDCG@10 = 60.90, surpassing gte-Qwen2-7B-instruct (60.25) and Linq-Embed-Mistral (60.19). The per-dataset scores and full results are reported in Table 7 of the source, with an observed advantage of +0.65 points over the prior best (Moreira et al., 2024).

Model leaderboard excerpt:

Model	Avg. NDCG@10
NV-Retriever-v1	60.90
gte-Qwen2-7B-instruct	60.25
Linq-Embed-Mistral	60.19
SFR-Embedding-2_R	60.18
NV-Embed-v1	59.36

For practical deployment, embeddings are exported to ONNX for real-time retrieval and indexed with FAISS (IVF-PQ or HNSW). In a RAG setting, user queries are embedded, top- $k$ passages retrieved, and context fed into an LLM for downstream generation.

8. Code Availability and Reproducibility

NV-Retriever-v1 is released under Apache 2.0 on HuggingFace: https://huggingface.co/nvidia/NV-Retriever-v1. The repository includes full PyTorch/Transformers+PEFT scripts for training, LoRA configuration, and data processing. Reproducibility steps:

Download and reformat datasets.
Precompute negatives using compute_negatives.py with TopK-PercPos (95%).
Execute train_stage1.py then train_stage2.py for full training as specified in Table 12.
Deploy using ONNX and FAISS as described.

All code, config files, and data recipes are publicly available, supporting end-to-end reproduction of the model and its methodology.

References

NV-Retriever-v1 and its underlying positive-aware hard-negative mining framework are documented in "NV-Retriever: Improving text embedding models with effective hard-negative mining" (Moreira et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

NV-Retriever: Improving text embedding models with effective hard-negative mining (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NV-Retriever-v1 Model.

NV-Retriever-v1: Text Embedding & Retrieval Model

1. Model Architecture

2. Training Objective and Loss Functions

3. Positive-Aware Hard-Negative Mining

4. Teacher and Base Model Selection

5. Ablation Studies on Hard-Negative Mining

6. Training Protocols and Data

7. Benchmark Results and Deployment

8. Code Availability and Reproducibility

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

NV-Retriever-v1: Text Embedding & Retrieval Model

1. Model Architecture

2. Training Objective and Loss Functions

3. Positive-Aware Hard-Negative Mining

4. Teacher and Base Model Selection

5. Ablation Studies on Hard-Negative Mining

6. Training Protocols and Data

7. Benchmark Results and Deployment

8. Code Availability and Reproducibility

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research