Papers
Topics
Authors
Recent
Search
2000 character limit reached

FaaF: Facts as a Function for the evaluation of generated text

Published 6 Mar 2024 in cs.CL | (2403.03888v3)

Abstract: The demand for accurate and efficient verification of information in texts generated by large LMs is at an all-time high, but remains unresolved. Recent efforts have focused on extracting and verifying atomic facts from these texts via prompting LM evaluators. However, we demonstrate that this method of prompting is unreliable when faced with incomplete or inaccurate reference information. We introduce Facts as a Function (FaaF), a new approach to the fact verification task that leverages the function-calling capabilities of LMs. FaaF significantly enhances the ability of LMs to identify unsupported facts in texts, while also improving efficiency and significantly lowering costs compared to prompt-based methods. Additionally, we propose a framework for evaluating factual recall in Retrieval Augmented Generation (RAG) systems, which we employ to compare prompt-based and FaaF methods using various LMs under challenging conditions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Feverous: Fact extraction and verification over unstructured and structured information.
  2. Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when it’s lying.
  3. Generating literal and implied subquestions to fact-check complex claims.
  4. The power of noise: Redefining retrieval for rag systems.
  5. Ragas: Automated evaluation of retrieval augmented generation.
  6. Gptscore: Evaluate as you desire.
  7. Rarr: Researching and revising what language models say, using language models.
  8. Language models (mostly) know what they know.
  9. Large language models struggle to learn long-tail knowledge.
  10. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  11. Latent retrieval for weakly supervised open domain question answering.
  12. Factuality enhanced language models for open-ended text generation.
  13. Halueval: A large-scale hallucination evaluation benchmark for large language models.
  14. Lost in the middle: How language models use long contexts.
  15. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories.
  16. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.
  17. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation.
  18. Is chatgpt a good nlg evaluator? a preliminary study.
  19. Chain-of-thought prompting elicits reasoning in large language models.
  20. Bartscore: Evaluating generated text as text generation.
  21. Interpretable unified language checking.
  22. Bertscore: Evaluating text generation with bert.
  23. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance.
Citations (3)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 3 likes about this paper.