Improving Consistency in Large Language Models through Chain of Guidance

Published 21 Feb 2025 in cs.CL | (2502.15924v1)

Abstract: Consistency is a fundamental dimension of trustworthiness in LLMs. For humans to be able to trust LLM-based applications, their outputs should be consistent when prompted with inputs that carry the same meaning or intent. Despite this need, there is no known mechanism to control and guide LLMs to be more consistent at inference time. In this paper, we introduce a novel alignment strategy to maximize semantic consistency in LLM outputs. Our proposal is based on Chain of Guidance (CoG), a multistep prompting technique that generates highly consistent outputs from LLMs. For closed-book question-answering (Q&A) tasks, when compared to direct prompting, the outputs generated using CoG show improved consistency. While other approaches like template-based responses and majority voting may offer alternative paths to consistency, our work focuses on exploring the potential of guided prompting. We use synthetic data sets comprised of consistent input-output pairs to fine-tune LLMs to produce consistent and correct outputs. Our fine-tuned models are more than twice as consistent compared to base models and show strong generalization capabilities by producing consistent outputs over datasets not used in the fine-tuning process.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Chain of Guidance (CoG), a novel multi-step prompting technique designed to significantly improve semantic consistency in Large Language Models.
The core CoG methodology employs a three-step pipeline: generating paraphrases, obtaining guided answers, and ranking responses through a multiple-choice evaluation.
Empirical results indicate CoG enhances LLM consistency metrics by up to 49%, and synthetic data from CoG can effectively fine-tune smaller models while preserving generalization.

Improving Consistency in LLMs through Chain of Guidance

The discussed paper presents a comprehensive approach to enhancing semantic consistency in LLMs through a novel multi-step prompting technique known as Chain of Guidance (CoG). The central issue addressed is the lack of consistency in LLM outputs when presented with paraphrased versions of the same input, an aspect crucial for building trust in LLM-based applications. This paper proposes a methodical pathway to address this gap and demonstrates the efficacy of CoG in improving consistency through empirical evaluations.

Core Methodology

The proposed Chain of Guidance strategy is executed in a three-step prompt-driven pipeline that incorporates in-context learning for finetuning LLM outputs. This process involves:

Paraphrase Generation: Initially, multiple realistic paraphrases of an input question are generated by prompting an auxiliary LLM. These paraphrases form the basis for synthetically enriched datasets aimed at training LLMs to recognize semantic equivalence.
Guided Answer Generation: Subsequent to generating paraphrased versions of a question, preliminary answers are obtained. These answers are then condensed into one or two-word responses to simplify the evaluation process.
Answer Ranking: Finally, the generated concise answers are subject to a multiple-choice evaluation where the LLM chooses the semantically correct option, ostensibly aligning its output with human-like consistency standards.

Empirical Evaluations

The research validates the CoG method by applying it across various LLMs, including Flan T5 XL and models in the Llama and GPT families. The analysis measures consistency through several semantic similarity metrics, prominently Entailment, Paraphrase, and Rouge-L. Results consistently indicate enhanced consistency metrics post-CoG application, illustrating the methodology’s effectiveness in aligning generated LLM outputs with human evaluation standards. For instance, the study found increases nearing 49% in semantic consistency when CoG is utilized.

Finetuning and Distillation Experiments

In addition to assessing CoG itself, the paper evaluates its potential to fine-tune less capable LLMs. Utilizing synthetic datasets generated from CoG, two finetuning strategies are explored—Low-Rank Adaptation (LoRA) and Supervised Fine-Tuning (SFT). Both methods are applied to models such as Llama 2 7B Chat and Llama 3 8B Instruct. Empirical results reveal that both LoRA and SFT approaches assume a positive trajectory concerning consistency metrics, while overall model efficacy across various benchmark tasks remains largely unaffected, preserving adaptability for diverse LLM applications. Importantly, the fine-tuned models showcase notable generalization capabilities, maintaining consistent output beyond the data utilized in finetuning.

Discussion and Implications

The authors argue that CoG represents an advantageous strategy for LLM alignment towards consistent outputs across semantically identical inputs—an essential development for improving the application of LLMs in practical, real-world contexts. By adapting CoG’s modular pipeline, consistency in other areas such as fairness and safety could also be enhanced, broadening its potential use cases. While alternative methods such as fixed answers or majority voting exist, CoG’s design optimizes the trade-off between flexibility, consistency, and computational efficiency.

Future Directions

Future work could explore the extension of CoG’s architecture to other domains of LLM improvement—such as creative writing, fairness, and robustness against adversarial attacks—by appropriately modifying individual components of the CoG pipeline to match domain-specific requirements. Moreover, larger datasets and more robust evaluation metrics could be employed to sustain and validate these adaptations further, potentially incorporating human-in-the-loop systems for higher accuracy in dataset curation and model evaluation.

In conclusion, the Chain of Guidance emerges as a promising framework for refining LLM outputs to reflect more consistent, human-like judgments, demonstrating a step towards increased reliability and trustworthiness in AI applications. This paper collectively lays a foundation for future advancements in aligning LLM behavior with nuanced human understanding and interaction.