Highlighted Chain-of-Thought Prompting (HoT) for Enhancing LLMs
The paper investigates an innovative approach, Highlighted Chain-of-Thought Prompting (HoT), designed to address the prevalent issue of hallucination in LLMs by enhancing the verifiability and accuracy of generated responses. The methodology involves modifying the input prompts with XML-style tags around key factual elements, thereby guiding the LLM to produce responses that reference these highlighted facts. This approach offers a systematic way to tackle the challenge of mixed factual and non-factual statements, which can be problematic for users evaluating the veracity of LLM outputs.
The authors demonstrate that HoT surpasses traditional chain-of-thought prompting across a variety of tasks, including arithmetic, reading comprehension, and logical reasoning. Specifically, experiments show that HoT consistently improves accuracy by 1.60 percentage points in arithmetic tasks, 2.58 in question-answering tasks, and 2.53 in logical reasoning tasks. This enhancement in performance is attributed to the model's improved ability to ground its responses in the facts highlighted from the input, thus reducing the likelihood of fabrications or inaccuracies in its outputs.
Moreover, HoT significantly aids in human verification processes. It enables more efficient verification of LLM responses, reducing the time spent by an average of 25% while also impacting user accuracy rates in perceptible ways. While highlights help recognize accurate answers, they can also inadvertently lead users to overestimate the accuracy of incorrect responses, a nuanced outcome that reflects the complexities inherent in AI-human interactions.
The study also conducted an ablation to discern the contribution of each component of HoT. It finds that repeating the question and tagging both the question and the answer provide incremental benefits to accuracy. Strategies such as including tags only in questions or responses did not yield the full advantages observed with the complete HoT method, indicating the importance of a holistic tagging approach to maximize benefits.
Despite its promising capabilities, the paper does identify limitations. For instance, smaller models such as Qwen or certain members of the Llama family sometimes struggled to adapt to the HoT format, likely due to their capacity limitations which might hinder consistent adherence to complex input-output transformations that HoT demands.
The paper presents a fresh avenue for improving the factual reliability of LLMs and proposes potential pathways for further research, such as finetuning models to naturally adapt HoT-like methods within their internal reasoning processes and exploring the impact of visual presentation of highlighted text on LLM performance. As LLMs continue to be deployed in increasingly critical applications, such advancements hold promise for enhancing their utility while mitigating risks associated with inaccurate or unverifiable information dissemination.