DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers

Published 25 Feb 2025 in cs.CL and cs.IR | (2502.18460v2)

Abstract: LLMs have demonstrated strong effectiveness and robustness while fine-tuned as dense retrievers. However, their large parameter size brings significant inference time computational challenges, including high encoding costs for large-scale corpora and increased query latency, limiting their practical deployment. While smaller retrievers offer better efficiency, they often fail to generalize effectively with limited supervised fine-tuning data. In this work, we introduce DRAMA, a training framework that leverages LLMs to train smaller generalizable dense retrievers. In particular, we adopt pruned LLMs as the backbone and train on diverse LLM-augmented data in a single-stage contrastive learning setup. Experiments show that DRAMA offers better multilingual and long-context capabilities than traditional encoder-based retrievers, and achieves strong performance across multiple tasks and languages. These highlight the potential of connecting the training of smaller retrievers with the growing advancements in LLMs, bridging the gap between efficiency and generalization.

Abstract PDF Upgrade to Chat

Summary

The paper introduces DRAMA, a framework using pruned LLMs and diverse LLM-based data augmentation to train smaller, efficient dense retrieval models.
Experimental results show DRAMA matches state-of-the-art performance on retrieval benchmarks while being more efficient, particularly in multilingual contexts.
DRAMA's approach enables deploying effective dense retrievers in resource-constrained and cross-lingual scenarios by improving generalization and reducing computational overhead.

An Analysis of "DRAMA: Diverse Augmentation from LLMs to Smaller Dense Retrievers"

The paper at hand proposes a novel framework named DRAMA, short for smaller Dense Retriever from diverse LLM Aug*MentA*tion, which seeks to enhance the training of dense retrieval models. Unlike traditional methods that employ LLMs directly, this approach uses LLMs to generate diversified data for training smaller and more efficient dense retrievers. The crux of the research is the adaptation of LLMs to address both multilingual and long-context retrieval tasks while reducing the computational burden typically associated with LLMs.

Motivation and Goals

The primary motivation behind this research is to reconcile the trade-off between the effectiveness and efficiency of dense retrievers. Large LLMs have shown robust performance in text retrieval tasks but at the expense of significant computational costs due to their large parameter sizes. Smaller models, while being computationally efficient, often struggle with generalization when limited supervised data is available for fine-tuning. Through DRAMA, the authors envisage a training framework that leverages the capabilities of LLMs to train smaller models that maintain strong retrieval performance across multiple tasks and languages.

Methodology

The paper introduces several critical strategies:

Pruned LLMs as Backbone: The researchers use pruned versions of LLMs, transforming them into efficient backbones for smaller dense retrievers. This involves pruning a model like Llama3.1 to explore configurations with fewer parameters but maintaining multilingual and long-context capabilities.
LLM-based Data Augmentation: Various methods are employed for generating augmented training data using LLMs. Techniques include the utilization of pseudo-queries from cropped sentences, synthetic queries from instruction-following LLMs, and listwise reranking using LLM preferences. These strategies aim to enhance the training dataset's diversity, thereby improving model generalizability.
Contrastive Learning Setup: A single-stage training framework incorporating diverse sets of augmented data alongside contrastive learning advances the generalization capabilities of dense retrievers.

Experimental Results

The research delineates a comprehensive comparison of the DRAMA framework against contemporary retrieval methods across various benchmarks, including BEIR and MIRACL. Noteworthy findings indicate:

DRAMA achieves an nDCG@10 of 56.9 on BEIR, demonstrating parity with existing state-of-the-art models.
The 0.3B variant of DRAMA matches larger models like Gecko, which employ 1B parameters.
DRAMA versions exhibit superior performance in multilingual contexts, surpassing previous baselines across several languages and retrieval tasks.
The pruned Llama backbone not only supports multilinguality but also showcases effective long-context retrieval performance even without explicit long-text training.

Implications and Future Directions

The paper argues for the practicality of DRAMA in deploying dense retrievers beyond traditional usage in English text retrieval to efficient cross-lingual scenarios without compromising performance. The convergence of reduced computational overhead with high retrieval efficacy holds significant implications for retrieval tasks in resource-constrained environments or applications requiring quick response times.

Looking forward, refining pruning methodologies could enhance model size flexibility and efficiency further. Additionally, expanding the repertoire of synthetic tasks and the breadth of language support within the augmentation process may contribute to even better cross-lingual and domain-specific adaptations.

Conclusion

The DRAMA framework represents a strategic shift in training dense retrieval models by leveraging the potentials of LLM-based data augmentation and utilizing pruned models as backbones. This research underscores the integration of efficiency with generalization, pushing the boundaries of what smaller, dense retrievers can achieve in a diverse retrieval landscape. As the field progresses, the insights gained here could spearhead further innovations in text retrieval infrastructures, making AI applications more accessible and effective.

Markdown Report Issue