- The paper introduces CInGS, a technique that trains LLMs to integrate external context during instruction tuning to reduce hallucinations.
- The methodology adapts standard instruction tuning by prepending context to responses and masking it during loss computation, yielding an average 5.5% improvement on text benchmarks.
- Applying CInGS to vision-language models results in reduced hallucinations and enhanced factual consistency even in challenging, distracting contexts.
Context-Informed Grounding Supervision
The paper "Context-Informed Grounding Supervision" explores a new approach for training LLMs to enhance their ability to generate contextually grounded responses. This approach, termed as Context-Informed Grounding Supervision (CInGS), addresses a well-documented limitation of LLMs: their propensity to hallucinate or produce responses based on incorrect internal knowledge, particularly when provided with external context.
This study builds on existing work that seeks to overcome the challenges of LLM hallucinations and the lack of controllability by integrating external knowledge. The authors acknowledge that merely appending relevant external context to a model's input at inference time does not guarantee the desired response grounding. Previous attempts to address this have included modifying decoding strategies, developing auxiliary modules for correction, and creating knowledge integration pipelines. However, less attention has been paid to training the LLM itself to naturally incorporate and prioritize external context during both training and inference.
To tackle these challenges, the authors propose CInGS, which involves a straightforward adaptation of the standard instruction tuning process. Usually, an LLM is trained to produce a response directly based on input instructions. In contrast, CInGS augments this approach by prepending relevant external context to the expected response during training, yet computing the loss only over the response tokens, while masking out the context. This design aims to reinforce the model's reliance on external information without altering general downstream performance.
In empirical testing, CInGS demonstrated superior grounding across both text and vision-language domains. In the text domain, CInGS outperformed traditional models across 11 information-seeking datasets, yielding an average absolute improvement of 5.5% and showing particularly significant gains in settings that critically demand context usage. This also surpasses the results achieved by other grounding training methods such as Self-RAG and FactTune. Moreover, when applied alongside inference-time grounding techniques like AdaCAD and CORG, CInGS brought about further performance enhancements, proving its complementary nature.
The study also highlights CInGS's value in vision-LLMs (VLMs). Model replacements with a CInGS-trained backbone in VLMs led to reduced hallucinations and better factual consistency, especially noted in benchmarks focused on hallucination detection. Notably, CInGS-trained VLMs maintained robust performance even in scenarios with potentially distracting information, and proved more adept at sustaining factual accuracy throughout an entire response, a common weakness of standard models.
The paper's analysis attributes CInGS's effectiveness to a dual mechanism: the model's reduced reliance on outdated internal knowledge, prompted by its training with relevant context, and an implicit shift in its behavior to prioritize new input context during response generation. This paradigm results in a tendency to gradually forget prior incorrect knowledge, creatively balancing between parametric recall and context grounding. Additionally, the attention analysis reveals that CInGS-trained models naturally focus more on external context rather than self-generated responses, reinforcing its contextual attention.
In conclusion, the introduction of CInGS marks a significant advancement in the grounding capabilities of LLMs. It offers a practical and scalable solution that aligns with the ongoing efforts to minimize hallucinations by leveraging external knowledge effectively. While its integration into vision-language scenarios opens new pathways for enhancing multimodal understanding, the core benefit of CInGS lies in its ability to integrate seamlessly with existing and future methods aimed at improving contextual responsiveness in AI models, without sacrificing general language understanding. The future may see further refinement and scaling of CInGS to leverage its benefits across a broader spectrum of AI tasks and applications.