Making Large Language Models A Better Foundation For Dense Retrieval
Abstract: Dense retrieval needs to learn discriminative text embeddings to represent the semantic relationship between query and document. It may benefit from the using of LLMs, given LLMs' strong capability on semantic understanding. However, the LLMs are pre-trained by text generation tasks, whose working pattern is completely different from representing texts as embeddings. As a result, it is imperative to study how to adapt LLMs properly so that they can be effectively initialized as the backbone encoder for dense retrieval. In this paper, we propose a novel approach, called LLaRA (LLM adapted for dense RetrievAl), which works as a post-hoc adaptation of LLM for the dense retrieval application. LLaRA consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), where the text embeddings from LLM are used to reconstruct the tokens for the input sentence and predict the tokens for the next sentence, respectively. LLaRA turns out to be simple, lightweight, and highly effective. It is applied to adapt LLaMA-2-7B (base) on the Wikipedia corpus, where it substantially improves the model's fine-tuned performances on a variety of dense retrieval benchmarks, like MSMARCO and BEIR. Our model and code will be made publicly available at BGE repository.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,, pages 4171–4186. Association for Computational Linguistics.
- Luyu Gao and Jamie Callan. 2021. Condenser: a pre-training architecture for dense retrieval. arXiv preprint arXiv:2104.08253.
- Luyu Gao and Jamie Callan. 2022. Unsupervised corpus aware language model pre-training for dense passage retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2843–2853, Dublin, Ireland.
- Coil: Revisit exact lexical match in information retrieval with contextualized inverted list. arXiv preprint arXiv:2104.07186.
- Efficiently teaching an effective dense retriever with balanced topic aware sampling. In SIGIR, pages 113–122.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.
- Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2356–2362.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- Zheng Liu and Yingxia Shao. 2022. Retromae: Pre-training retrieval-oriented transformers via masked auto-encoder. arXiv preprint arXiv:2205.12035.
- Retromae-2: Duplex masked auto-encoder for pre-training retrieval-oriented language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2635–2648.
- Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 2645–2652.
- Pre-train a discriminative text encoder for dense retrieval via contrastive span prediction. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 848–858.
- Prop: Pre-training with representative words prediction for ad-hoc retrieval. In Proceedings of the 14th ACM international conference on web search and data mining, pages 283–291.
- B-prop: bootstrapped pre-training with representative words prediction for ad-hoc retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1513–1522.
- Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319.
- Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
- Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005.
- Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
- Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899.
- Multi-stage document ranking with bert. arXiv preprint arXiv:1910.14424.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Rocketqav2: A joint training method for dense passage retrieval and passage re-ranking. arXiv preprint arXiv:2110.07367.
- Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663.
- Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288.
- Simlm: Pre-training with representation bottleneck for dense passage retrieval. arXiv preprint arXiv:2207.02578.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597.
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808.
- Optimizing dense retrieval model training with hard negatives. In SIGIR, pages 1503–1512.
- Language models are universal embedders. arXiv preprint arXiv:2310.08232.
- Simans: Simple ambiguous negatives sampling for dense text retrieval. arXiv preprint arXiv:2210.11773.
- Rankt5: Fine-tuning t5 for text ranking with ranking losses. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2308–2313.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.