Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded Dialogue Generation
Abstract: In this work, we propose sequence-level certainty as a common theme over hallucination in Knowledge Grounded Dialogue Generation (KGDG). We explore the correlation between the level of hallucination in model responses and two types of sequence-level certainty: probabilistic certainty and semantic certainty. Empirical results reveal that higher levels of both types of certainty in model responses are correlated with lower levels of hallucination. We further propose Certainty-based Response Ranking (CRR), a decoding-time hallucination mitigation method that samples several response candidates, ranks them based on sequence-level certainty, and outputs the response with the highest certainty level. Aligning with our definitions of sequence-level certainty, we design 2 types of CRR approaches: Probabilistic CRR (P-CRR) and Semantic CRR (S-CRR). P-CRR ranks individually sampled model responses using the arithmetic mean log-probability of the entire sequence. S-CRR approaches certainty estimation from meaning-space, and ranks model response candidates based on their semantic certainty level as measured by an entailment-based Agreement Score (AS). Through extensive experiments across 3 KGDG datasets, 3 decoding methods, and 4 KGDG models, we validate the effectiveness of CRR for reducing hallucination in KGDG task.
- Focus attention: Promoting faithfulness and diversity in summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6078–6095, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.474. URL https://aclanthology.org/2021.acl-long.474.
- Constrained decoding for neural NLG from compositional representations in task-oriented dialogue. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 831–844, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1080. URL https://aclanthology.org/P19-1080.
- Incorporating external knowledge into machine reading for generative question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2521–2530, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1255. URL https://aclanthology.org/D19-1255.
- R. BISIANI. Beam search. Encyclopedia of Artificial Intelligence, 1992. URL https://cir.nii.ac.jp/crid/1574231875360981248.
- Factual error correction for abstractive summarization models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6251–6258, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.506. URL https://aclanthology.org/2020.emnlp-main.506.
- Faithful to the original: Fact aware neural abstractive summarization. ArXiv, abs/1711.04434, 2017. URL https://api.semanticscholar.org/CorpusID:19198109.
- Improving faithfulness in abstractive summarization with contrast candidate generation and selection. arXiv preprint arXiv:2104.09061, 2021.
- Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
- Handling divergent reference texts when evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4884–4895, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1483. URL https://aclanthology.org/P19-1483.
- Wizard of Wikipedia: Knowledge-powered conversational agents. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- Multi-fact correction in abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9320–9331, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.749. URL https://aclanthology.org/2020.emnlp-main.749.
- FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5055–5070, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.454. URL https://aclanthology.org/2020.acl-main.454.
- Semantic noise matters for neural natural language generation. In Proceedings of the 12th International Conference on Natural Language Generation, pp. 421–426, Tokyo, Japan, October–November 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-8652. URL https://aclanthology.org/W19-8652.
- Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2197–2214, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.168. URL https://aclanthology.org/2021.emnlp-main.168.
- FaithDial: A Faithful Benchmark for Information-Seeking Dialogue. Transactions of the Association for Computational Linguistics, 10:1473–1490, 12 2022a. doi: 10.1162/tacl˙a˙00529.
- Evaluating attribution in dialogue systems: The BEGIN benchmark. Transactions of the Association for Computational Linguistics, 10:1066–1083, 2022b. doi: 10.1162/tacl˙a˙00506. URL https://aclanthology.org/2022.tacl-1.62.
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://aclanthology.org/P18-1082.
- Using local knowledge graph construction to scale Seq2Seq models to multi-document inputs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4186–4196, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1428. URL https://aclanthology.org/D19-1428.
- Katja Filippova. Controlled hallucinations: Learning to generate faithfully from noisy data. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 864–870, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.76. URL https://aclanthology.org/2020.findings-emnlp.76.
- Creating training corpora for NLG micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 179–188, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1017. URL https://aclanthology.org/P17-1017.
- Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
- Deep learning, volume 1. MIT Press, 2016.
- Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In Proc. Interspeech 2019, pp. 1891–1895, 2019. doi: 10.21437/Interspeech.2019-3079. URL http://dx.doi.org/10.21437/Interspeech.2019-3079.
- Mind the facts: Knowledge-boosted coherent abstractive text summarization. ArXiv, abs/2006.15435, 2020. URL https://api.semanticscholar.org/CorpusID:204735695.
- The curious case of neural text degeneration. ArXiv, abs/1904.09751, 2019. URL https://api.semanticscholar.org/CorpusID:127986954.
- q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7856–7870, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.619. URL https://aclanthology.org/2021.emnlp-main.619.
- TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, pp. 161–175, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.dialdoc-1.19. URL https://aclanthology.org/2022.dialdoc-1.19.
- Knowledge graph-augmented abstractive summarization with semantic-driven cloze reward. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5094–5107, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.457. URL https://aclanthology.org/2020.acl-main.457.
- The factual inconsistency problem in abstractive text summarization: A survey. ArXiv, abs/2104.14839, 2021. URL https://api.semanticscholar.org/CorpusID:233476302.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, mar 2023. doi: 10.1145/3571730. URL https://doi.org/10.1145/3571730.
- Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4940–4957, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.393. URL https://aclanthology.org/2021.naacl-main.393.
- Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023.
- Hallucinations in neural machine translation. 2018. URL https://api.semanticscholar.org/CorpusID:53593076.
- Ensure the correctness of the summary: Incorporate entailment knowledge into abstractive sentence summarization. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1430–1441, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL https://aclanthology.org/C18-1121.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
- Slot-consistent NLG for task-oriented dialogue systems with iterative rectification network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 97–106, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.10. URL https://aclanthology.org/2020.acl-main.10.
- Knowledge-grounded dialogue generation with a unified knowledge representation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 206–218, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.15. URL https://aclanthology.org/2022.naacl-main.15.
- Incremental transformer with deliberation decoder for document grounded conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 12–21, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1002. URL https://aclanthology.org/P19-1002.
- Towards faithfulness in open domain table-to-text generation from an entity-centric view. In AAAI Conference on Artificial Intelligence, 2021. URL https://api.semanticscholar.org/CorpusID:231942490.
- A token-level reference-free hallucination detection benchmark for free-form text generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6723–6737, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.464. URL https://aclanthology.org/2022.acl-long.464.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2017. URL https://api.semanticscholar.org/CorpusID:53592270.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
- Identifying fluently inadequate output in neural and statistical machine translation. In Proceedings of Machine Translation Summit XVII: Research Track, pp. 233–243, Dublin, Ireland, August 2019. European Association for Machine Translation. URL https://aclanthology.org/W19-6623.
- On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1906–1919, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.173. URL https://aclanthology.org/2020.acl-main.173.
- Improving factual consistency between a response and persona facts. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 549–562, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.44. URL https://aclanthology.org/2021.eacl-main.44.
- Mteb: Massive text embedding benchmark, 2023.
- Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852, 2023.
- Correcting length bias in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 212–223, 2018.
- Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models, 2021.
- A simple recipe towards reducing hallucination in neural surface realisation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2673–2679, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1256. URL https://aclanthology.org/P19-1256.
- Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020.
- Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4812–4829, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.383. URL https://aclanthology.org/2021.naacl-main.383.
- ToTTo: A controlled table-to-text generation dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1173–1186, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.89. URL https://aclanthology.org/2020.emnlp-main.89.
- Data-to-text generation with content selection and planning. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 6908–6915. AAAI Press, 2019. doi: 10.1609/aaai.v33i01.33016908. URL https://doi.org/10.1609/aaai.v33i01.33016908.
- Language models are unsupervised multitask learners. 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- Increasing faithfulness in knowledge-grounded dialogue with controllable features. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 704–718, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.58. URL https://aclanthology.org/2021.acl-long.58.
- Controlling hallucinations at word level in data-to-text generation. Data Mining and Knowledge Discovery, 36:318 – 354, 2021a. URL https://api.semanticscholar.org/CorpusID:231802211.
- Data-QuestEval: A referenceless metric for data-to-text semantic evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8029–8036, Online and Punta Cana, Dominican Republic, November 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.633. URL https://aclanthology.org/2021.emnlp-main.633.
- Rome was built in 1776: A case study on factual correctness in knowledge-grounded response generation. ArXiv, abs/2110.05456, 2021. URL https://api.semanticscholar.org/CorpusID:238583083.
- Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 3784–3803, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.320. URL https://aclanthology.org/2021.findings-emnlp.320.
- Joint parsing and generation for abstractive summarization. ArXiv, abs/1911.10389, 2019. URL https://api.semanticscholar.org/CorpusID:208267908.
- Sticking to the facts: Confident decoding for faithful data-to-text generation. ArXiv, abs/1910.08684, 2019. URL https://api.semanticscholar.org/CorpusID:204800468.
- Sketch and refine: Towards faithful and informative table-to-text generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4831–4843, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.427. URL https://aclanthology.org/2021.findings-acl.427.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw.
- Towards faithful neural table-to-text generation with content-matching constraints. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1072–1086, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.101. URL https://aclanthology.org/2020.acl-main.101.
- Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1711–1721, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1199. URL https://aclanthology.org/D15-1199.
- A controllable model of grounded response generation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16):14085–14093, May 2021. doi: 10.1609/aaai.v35i16.17658. URL https://ojs.aaai.org/index.php/AAAI/article/view/17658.
- On hallucination and predictive uncertainty in conditional language generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2734–2744, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.236. URL https://aclanthology.org/2021.eacl-main.236.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- Detecting hallucinated content in conditional neural sequence generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 1393–1404, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.120. URL https://aclanthology.org/2021.findings-acl.120.
- A dataset for document grounded conversations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.
- Enhancing factual consistency of abstractive summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 718–733, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.58. URL https://aclanthology.org/2021.naacl-main.58.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.