Papers
Topics
Authors
Recent
Search
2000 character limit reached

HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs

Published 3 Mar 2025 in cs.CL and cs.HC | (2503.02003v3)

Abstract: An Achilles heel of LLMs is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.

Summary

Highlighted Chain-of-Thought Prompting (HoT) for Enhancing LLMs

The paper investigates an innovative approach, Highlighted Chain-of-Thought Prompting (HoT), designed to address the prevalent issue of hallucination in LLMs by enhancing the verifiability and accuracy of generated responses. The methodology involves modifying the input prompts with XML-style tags around key factual elements, thereby guiding the LLM to produce responses that reference these highlighted facts. This approach offers a systematic way to tackle the challenge of mixed factual and non-factual statements, which can be problematic for users evaluating the veracity of LLM outputs.

The authors demonstrate that HoT surpasses traditional chain-of-thought prompting across a variety of tasks, including arithmetic, reading comprehension, and logical reasoning. Specifically, experiments show that HoT consistently improves accuracy by 1.60 percentage points in arithmetic tasks, 2.58 in question-answering tasks, and 2.53 in logical reasoning tasks. This enhancement in performance is attributed to the model's improved ability to ground its responses in the facts highlighted from the input, thus reducing the likelihood of fabrications or inaccuracies in its outputs.

Moreover, HoT significantly aids in human verification processes. It enables more efficient verification of LLM responses, reducing the time spent by an average of 25% while also impacting user accuracy rates in perceptible ways. While highlights help recognize accurate answers, they can also inadvertently lead users to overestimate the accuracy of incorrect responses, a nuanced outcome that reflects the complexities inherent in AI-human interactions.

The study also conducted an ablation to discern the contribution of each component of HoT. It finds that repeating the question and tagging both the question and the answer provide incremental benefits to accuracy. Strategies such as including tags only in questions or responses did not yield the full advantages observed with the complete HoT method, indicating the importance of a holistic tagging approach to maximize benefits.

Despite its promising capabilities, the paper does identify limitations. For instance, smaller models such as Qwen or certain members of the Llama family sometimes struggled to adapt to the HoT format, likely due to their capacity limitations which might hinder consistent adherence to complex input-output transformations that HoT demands.

The paper presents a fresh avenue for improving the factual reliability of LLMs and proposes potential pathways for further research, such as finetuning models to naturally adapt HoT-like methods within their internal reasoning processes and exploring the impact of visual presentation of highlighted text on LLM performance. As LLMs continue to be deployed in increasingly critical applications, such advancements hold promise for enhancing their utility while mitigating risks associated with inaccurate or unverifiable information dissemination.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 5 likes about this paper.