NV-Retriever-v1: Text Embedding & Retrieval Model
- NV-Retriever-v1 is a state-of-the-art text embedding and retrieval model that uses a bi-encoder Mistral-7B backbone for semantic search and retrieval-augmented generation.
- It employs a novel positive-aware hard-negative mining algorithm that filters out false negatives, significantly boosting performance metrics like NDCG@10.
- Its scalable architecture, combined with ONNX and FAISS deployment, ensures efficient real-time retrieval for practical RAG applications.
NV-Retriever-v1 is a state-of-the-art text embedding and retrieval model designed to optimize information retrieval tasks such as semantic search and retrieval-augmented generation (RAG). It is based on a bi-encoder adaptation of the Mistral-7B Transformer decoder, fine-tuned using a novel family of positive-aware hard-negative mining algorithms, achieving first place on the MTEB Retrieval (BEIR) benchmark as of July 2024 (Moreira et al., 2024).
1. Model Architecture
NV-Retriever-v1 adopts the “Mistral-7B” Transformer as its core backbone, using 32 causal-language-model layers. It is converted to a bi-encoder architecture by enabling bidirectional self-attention during embedding and appending a mean-pooling read-out head. The model architecture is identical in bi-encoder construction to NV-Embed-v1 (Lee et al., 2024) and e5-Mistral (Wang et al., 2023). Every attention matrix is augmented with LoRA adapters (rank , scaling factor ), with all other model parameters frozen during fine-tuning.
Given a sequence of tokens, the embedding is computed as:
where is the last-layer hidden state of token and is the set of non-instruction, non-padding tokens.
Key Architectural Parameters:
| Parameter | Value |
|---|---|
| Transformer Layers | 32 |
| Hidden/Embedding Size | 4096 |
| Attention | Bidirectional (fine-tuning) |
| Max Query Length | 192 tokens |
| Max Passage Length | 512 tokens |
| LoRA Rank / | 16 / 32 |
2. Training Objective and Loss Functions
NV-Retriever-v1 employs contrastive learning via the InfoNCE loss, mixing in-batch and hard negatives for each training example. The objective promotes high similarity between the query embedding and its positive passage while suppressing similarity to hard negatives and in-batch negatives.
Loss Function:
Let , and temperature . For batch of examples,
and the full batch loss is
During Stage 1 (retrieval only), each query is paired with one hard negative and in-batch negatives; Stage 2 (multi-task) uses five hard negatives (no in-batch).
3. Positive-Aware Hard-Negative Mining
The central innovation of NV-Retriever-v1 lies in its positive-aware hard-negative mining, which uses the positive relevance score to remove false negatives and anchor thresholds for hard negative selection.
For each (query, positive) pair, a pretrained teacher embedding model produces a positive similarity score:
Negatives with similarity scores close to are likely unlabelled positives and are excluded. In the TopK-PercPos algorithm, negative candidates are retained if (typically margin_pct); the top satisfy this condition.
TopK-PercPos Algorithm:
1 2 3 4 5 6 |
for each (q_i, p_i^+) in train_pairs: q_emb, pos_emb = teacher_model.encode(q_i, p_i^+) pos_score = sim(q_emb, pos_emb) S = topM_by_similarity(q_i, C) # M >> K S = [c for c in S if sim(q_emb, c) <= margin_pct * pos_score and c != p_i^+] hard_negatives_i = S[:K] |
4. Teacher and Base Model Selection
Ablation studies contrasted several teacher models for negative mining: BM25 (sparse), e5-large-unsupervised, e5-large-v2, snowflake-arctic-embed-l, e5-mistral-7b-instruct, and NV-Embed-v1. Teacher selection impacts downstream retrieval quality substantially.
Teacher Model Impact (zero-shot BEIR QA, avg. NDCG@10):
| Teacher Model | Avg. NDCG@10 |
|---|---|
| BM25 | 0.5002 |
| random negatives | 0.5248 |
| e5-large-unsupervised | 0.5494 |
| e5-large-v2 | 0.5704 |
| snowflake-arctic-embed-l | 0.5728 |
| NV-Embed-v1 | 0.5744 |
| e5-mistral-7b-instruct | 0.5810 |
Using a 7B-parameter Mistral-based teacher outperformed sparse retrieval and BERT-backbone models by up to 8 points (NDCG@10), suggesting significant teacher effects on mining efficacy.
5. Ablation Studies on Hard-Negative Mining
Extensive ablations addressed the performance of various mining strategies across teacher and base model permutations. When fine-tuning e5-large-unsupervised (334M parameters, 4 negatives), the TopK-PercPos and TopK-PercPos+sampling consistently achieved the highest NDCG@10 (0.5856–0.5857), outperforming Naive Top-K (0.5407) and fixed margin approaches.
Mining Method Comparison (e5-large-unsupervised, avg. NDCG@10):
| Mining Method | Config | Avg. NDCG@10 |
|---|---|---|
| Naive Top-K | – | 0.5407 |
| Top-K shifted by N | N=10 | 0.5695 |
| TopK-Abs | =0.70 | 0.5759 |
| TopK-MarginPos | =0.05 | 0.5835 |
| TopK-PercPos | 95% | 0.5856 |
| TopK-PercPos+sampling (top-10) | softmax k=10 | 0.5856 |
| Top1+sampled(3 from top-10) | softmax k=10 | 0.5857 |
On the Mistral-7B-v0.1 base with 1 hard negative, TopK-PercPos+sampling reached 0.6499 vs. 0.6214 for Naive Top-K. The consistent benefits of positive-aware mining over fixed or shifted-rank methods indicate its general utility.
6. Training Protocols and Data
NV-Retriever-v1 training comprises two sequential stages:
Stage 1 (retrieval-only):
- Datasets: ArguAna, BioASQ, FEVER, FiQA, GOOAQ, HotpotQA, MS-MARCO, NFCorpus, NLI, Natural Questions, PAQ, SciFacts, SQuAD, StackExchange, TriviaQA (670K queries)
- Optimizer: AdamW, lr, warmup=100 steps
- Batch size: 32 (gradient accumulation 4 effective 128)
- Negatives: 1 hard negative + in-batch negatives
- Epochs: 12 (34M steps), A100 GPUs, mixed precision
Stage 2 (classification/multitask):
- Same datasets plus classification (Banking77, Amazon Reviews, Emotion, IMDB, MTOP, Toxic, TweetSentiment) and STS/regression/clustering
- Batch size: 8 (accumulation 16 effective 128), 5 hard negatives (no in-batch), 12 epochs (10M steps)
Preprocessing: Query inputs are prefixed with instruction strings (masked out for pooling). Text is lowercased and tokenized via HuggingFace’s Mistral BPE. Passages receive no prefix.
7. Benchmark Results and Deployment
On the July 2024 MTEB Retrieval (15 BEIR datasets), NV-Retriever-v1 achieves 1st place with avg. NDCG@10 = 60.90, surpassing gte-Qwen2-7B-instruct (60.25) and Linq-Embed-Mistral (60.19). The per-dataset scores and full results are reported in Table 7 of the source, with an observed advantage of +0.65 points over the prior best (Moreira et al., 2024).
Model leaderboard excerpt:
| Model | Avg. NDCG@10 |
|---|---|
| NV-Retriever-v1 | 60.90 |
| gte-Qwen2-7B-instruct | 60.25 |
| Linq-Embed-Mistral | 60.19 |
| SFR-Embedding-2_R | 60.18 |
| NV-Embed-v1 | 59.36 |
For practical deployment, embeddings are exported to ONNX for real-time retrieval and indexed with FAISS (IVF-PQ or HNSW). In a RAG setting, user queries are embedded, top- passages retrieved, and context fed into an LLM for downstream generation.
8. Code Availability and Reproducibility
NV-Retriever-v1 is released under Apache 2.0 on HuggingFace: https://huggingface.co/nvidia/NV-Retriever-v1. The repository includes full PyTorch/Transformers+PEFT scripts for training, LoRA configuration, and data processing. Reproducibility steps:
- Download and reformat datasets.
- Precompute negatives using
compute_negatives.pywith TopK-PercPos (95%). - Execute
train_stage1.pythentrain_stage2.pyfor full training as specified in Table 12. - Deploy using ONNX and FAISS as described.
All code, config files, and data recipes are publicly available, supporting end-to-end reproduction of the model and its methodology.
References
NV-Retriever-v1 and its underlying positive-aware hard-negative mining framework are documented in "NV-Retriever: Improving text embedding models with effective hard-negative mining" (Moreira et al., 2024).