Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Course Shared Task on Evaluating LLM Output for Clinical Questions

Published 31 Jul 2024 in cs.CL | (2408.00122v1)

Abstract: This paper presents a shared task that we organized at the Foundations of Language Technology (FoLT) course in 2023/2024 at the Technical University of Darmstadt, which focuses on evaluating the output of LLMs in generating harmful answers to health-related clinical questions. We describe the task design considerations and report the feedback we received from the students. We expect the task and the findings reported in this paper to be relevant for instructors teaching NLP and designing course assignments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (8)
  1. Missing counter-evidence renders NLP fact-checking unrealistic for misinformation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5916–5936, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  2. Missci: Reconstructing fallacies in misrepresented science. In Proceedings of the 62st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand. Association for Computational Linguistics.
  3. Wikicontradict: A benchmark for evaluating llms on real-world knowledge conflicts from wikipedia. arXiv preprint arXiv:2406.13805.
  4. Neema Kotonya and Francesca Toni. 2020. Explainable automated fact-checking for public health claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7740–7754, Online. Association for Computational Linguistics.
  5. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
  6. Label Studio: Data labeling software. Open source software available from https://github.com/heartexlabs/label-studio.
  7. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv preprint, abs/2307.09288.
  8. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online. Association for Computational Linguistics.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.