Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance
Abstract: NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in LLMs offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. Through a case study of four datasets from the TRUE benchmark, covering different tasks and domains, we empirically analyze the labeling quality of existing datasets, and compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve model performance.
- Quality control in crowdsourcing systems: Issues and directions. IEEE Internet Computing, 17(2):76–81, 2013. doi: 10.1109/MIC.2013.20.
- Palm 2 technical report. CoRR, abs/2305.10403, 2023. doi: 10.48550/ARXIV.2305.10403. URL https://doi.org/10.48550/arXiv.2305.10403.
- Validity, agreement, consensuality and annotated data quality. In International Conference on Language Resources and Evaluation, 2022. URL https://api.semanticscholar.org/CorpusID:251465628.
- Large language models as annotators: A preliminary evaluation for annotating low-resource language content. In Daniel Deutsch, Rotem Dror, Steffen Eger, Yang Gao, Christoph Leiter, Juri Opitz, and Andreas Rücklé (eds.), Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems, pp. 100–107, Bali, Indonesia, November 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eval4nlp-1.8. URL https://aclanthology.org/2023.eval4nlp-1.8.
- Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
- On behalf of the stakeholders: Trends in NLP model interpretability in the era of llms. CoRR, abs/2407.19200, 2024. doi: 10.48550/ARXIV.2407.19200. URL https://doi.org/10.48550/arXiv.2407.19200.
- Measuring the robustness of nlp models to domain shifts. arXiv preprint arXiv:2306.00168, 2024. URL https://doi.org/10.48550/arXiv.2306.00168.
- Understanding the tradeoff between cost and quality of expert annotations for keyphrase extraction. In Law, 2020. URL https://api.semanticscholar.org/CorpusID:227231506.
- Probing the “creativity” of large language models: Can models produce divergent semantic association? In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 12881–12888, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.858. URL https://aclanthology.org/2023.findings-emnlp.858.
- Is a large language model a good annotator for event extraction? In AAAI Conference on Artificial Intelligence, 2024. URL https://api.semanticscholar.org/CorpusID:268710109.
- Can large language models be an alternative to human evaluations? In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15607–15631, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.870. URL https://aclanthology.org/2023.acl-long.870.
- Detecting label errors by using pre-trained language models. In Conference on Empirical Methods in Natural Language Processing, 2022a. URL https://api.semanticscholar.org/CorpusID:249063028.
- Detecting label errors by using pre-trained language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 9074–9091. Association for Computational Linguistics, 2022b. doi: 10.18653/V1/2022.EMNLP-MAIN.618. URL https://doi.org/10.18653/v1/2022.emnlp-main.618.
- The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404–413, 1934. ISSN 00063444, 14643510. URL http://www.jstor.org/stable/2331986.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1423. URL https://doi.org/10.18653/v1/n19-1423.
- Thomas G. Dietterich. Ensemble methods in machine learning. 2007. URL https://api.semanticscholar.org/CorpusID:10765854.
- Wizard of wikipedia: Knowledge-powered conversational agents. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=r1l73iRqKm.
- The llama 3 herd of models. CoRR, abs/2407.21783, 2024. doi: 10.48550/ARXIV.2407.21783. URL https://doi.org/10.48550/arXiv.2407.21783.
- Evaluating attribution in dialogue systems: The BEGIN benchmark. Transactions of the Association for Computational Linguistics, 10:1066–1083, 2022. doi: 10.1162/tacl˙a˙00506. URL https://aclanthology.org/2022.tacl-1.62.
- Qafacteval: Improved qa-based factual consistency evaluation for summarization. In North American Chapter of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:245218667.
- Gpt is not an annotator: The necessity of human annotation in fairness benchmark construction. ArXiv, abs/2405.15760, 2024. URL https://api.semanticscholar.org/CorpusID:270045683.
- Joseph L. Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76:378–382, 1971. URL https://api.semanticscholar.org/CorpusID:143544759.
- Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25:845–869, 2014. URL https://api.semanticscholar.org/CorpusID:6054025.
- Faithful explanations of black-box NLP models using llm-generated counterfactuals. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=UMfcdRIotC.
- TrueTeacher: Learning factual consistency evaluation with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2053–2070, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.127. URL https://aclanthology.org/2023.emnlp-main.127.
- Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences of the United States of America, 120, 2023. URL https://api.semanticscholar.org/CorpusID:257766307.
- Inaccurate labels in weakly-supervised deep learning: Automatic identification and correction and their impact on classification performance. IEEE Journal of Biomedical and Health Informatics, 24:2701–2710, 2020. URL https://api.semanticscholar.org/CorpusID:211232156.
- Annollm: Making large language models to be better crowdsourced annotators. In North American Chapter of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar.org/CorpusID:257805087.
- Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
- q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. ArXiv, abs/2104.08202, 2021. URL https://api.semanticscholar.org/CorpusID:233289483.
- TRUE: re-evaluating factual consistency evaluation. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pp. 3905–3920. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.NAACL-MAIN.287. URL https://doi.org/10.18653/v1/2022.naacl-main.287.
- Mistral 7b. CoRR, abs/2310.06825, 2023. doi: 10.48550/ARXIV.2310.06825. URL https://doi.org/10.48550/arXiv.2310.06825.
- Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
- The shape of and solutions to the mturk quality crisis. Political Science Research and Methods, 8(4):614–629, 2020. URL https://www.cambridge.org/core/journals/political-science-research-and-methods/article/shape-of-and-solutions-to-the-mturk-quality-crisis/521AEEB9A9753D5C6038440BD123826C.
- Llms in the loop: Leveraging large language model annotations for active learning in low-resource languages. ArXiv, abs/2404.02261, 2024. URL https://api.semanticscholar.org/CorpusID:268876095.
- Meganno+: A human-llm collaborative annotation system. In Conference of the European Chapter of the Association for Computational Linguistics, 2024. URL https://api.semanticscholar.org/CorpusID:268041346.
- Evaluating the factual consistency of abstractive text summarization. In Conference on Empirical Methods in Natural Language Processing, 2019. URL https://api.semanticscholar.org/CorpusID:204976362.
- SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177, 2022. doi: 10.1162/tacl˙a˙00453. URL https://aclanthology.org/2022.tacl-1.10.
- Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation. ArXiv, abs/2310.15638, 2023. URL https://api.semanticscholar.org/CorpusID:264439555.
- The colorful future of llms: Evaluating and improving llms as emotional supporters for queer youth. In Kevin Duh, Helena Gómez-Adorno, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pp. 2040–2079. Association for Computational Linguistics, 2024. doi: 10.18653/V1/2024.NAACL-LONG.113. URL https://doi.org/10.18653/v1/2024.naacl-long.113.
- Research on data quality control of crowdsourcing annotation: A survey. In 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), pp. 201–208, 2020. doi: 10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00044.
- An extended model of natural logic. In Harry Bunt (ed.), Proceedings of the Eight International Conference on Computational Semantics, pp. 140–156, Tilburg, The Netherlands, January 2009. Association for Computational Linguistics. URL https://aclanthology.org/W09-3714.
- On faithfulness and factuality in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 1906–1919. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.ACL-MAIN.173. URL https://doi.org/10.18653/v1/2020.acl-main.173.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1797–1807, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1206. URL https://aclanthology.org/D18-1206.
- Combining crowd and expert labels using decision theoretic active learning. In AAAI Conference on Human Computation & Crowdsourcing, 2015. URL https://api.semanticscholar.org/CorpusID:12521058.
- Self: Learning to filter noisy labels with self-ensembling. ArXiv, abs/1910.01842, 2019. URL https://api.semanticscholar.org/CorpusID:203737303.
- Confident learning: Estimating uncertainty in dataset labels. J. Artif. Intell. Res., 70:1373–1411, 2019. URL https://api.semanticscholar.org/CorpusID:207870256.
- Pervasive label errors in test sets destabilize machine learning benchmarks. ArXiv, abs/2103.14749, 2021. URL https://api.semanticscholar.org/CorpusID:232404905.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
- Training language models to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
- Identifying mislabeled data using the area under the margin ranking. ArXiv, abs/2001.10528, 2020. URL https://api.semanticscholar.org/CorpusID:210932316.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL https://jmlr.org/papers/v21/20-074.html.
- SQuAD: 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264.
- Identifying incorrect labels in the conll-2003 corpus. In Raquel Fernández and Tal Linzen (eds.), Proceedings of the 24th Conference on Computational Natural Language Learning, CoNLL 2020, Online, November 19-20, 2020, pp. 215–226. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.CONLL-1.16. URL https://doi.org/10.18653/v1/2020.conll-1.16.
- Investigating the disagreement between clinicians’ ratings of patients in icus. IEEE J. Biomed. Health Informatics, 17(4):843–852, 2013. doi: 10.1109/JBHI.2013.2252182. URL https://doi.org/10.1109/JBHI.2013.2252182.
- Get your vitamin C! robust fact verification with contrastive evidence. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 624–643, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.52. URL https://aclanthology.org/2021.naacl-main.52.
- Cheap and fast – but is it good? evaluating non-expert annotations for natural language tasks. In Conference on Empirical Methods in Natural Language Processing, 2008. URL https://api.semanticscholar.org/CorpusID:7008675.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Trans. Mach. Learn. Res., 2023, 2023. URL https://openreview.net/forum?id=uyTL5Bvosj.
- With a little push, NLI models can robustly and efficiently predict faithfulness. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 914–924, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.79. URL https://aclanthology.org/2023.acl-short.79.
- The impact of inconsistent human annotations on AI driven clinical decision making. npj Digit. Medicine, 6, 2023. doi: 10.1038/S41746-023-00773-3. URL https://doi.org/10.1038/s41746-023-00773-3.
- Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 2818–2826. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.308. URL https://doi.org/10.1109/CVPR.2016.308.
- Evaluating the factual consistency of large language models through news summarization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp. 5220–5255, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.322. URL https://aclanthology.org/2023.findings-acl.322.
- FEVER: a large-scale dataset for fact extraction and VERification. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 809–819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https://aclanthology.org/N18-1074.
- Petter Törnberg. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. ArXiv, abs/2304.06588, 2023. URL https://api.semanticscholar.org/CorpusID:258108255.
- Learning from disagreement: A survey. J. Artif. Intell. Res., 72:1385–1470, 2021. doi: 10.1613/JAIR.1.12752. URL https://doi.org/10.1613/jair.1.12752.
- Navigating cultural chasms: Exploring and unlocking the cultural POV of text-to-image models. CoRR, abs/2310.01929, 2023. doi: 10.48550/ARXIV.2310.01929. URL https://doi.org/10.48550/arXiv.2310.01929.
- Prevalence and prevention of large language model use in crowd work. CoRR, abs/2310.15683, 2023a. doi: 10.48550/ARXIV.2310.15683. URL https://doi.org/10.48550/arXiv.2310.15683.
- Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. CoRR, abs/2306.07899, 2023b. doi: 10.48550/ARXIV.2306.07899. URL https://doi.org/10.48550/arXiv.2306.07899.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=rJ4km2R5t7.
- Less is more for improving automatic evaluation of factual consistency. In Yi Yang, Aida Davani, Avi Sil, and Anoop Kumar (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pp. 324–334, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-industry.27. URL https://aclanthology.org/2024.naacl-industry.27.
- Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340. URL https://aclanthology.org/2022.emnlp-main.340.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.
- WeCheck: Strong factual consistency checker via weakly supervised learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 307–321, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.18. URL https://aclanthology.org/2023.acl-long.18.
- Factual consistency evaluation for text summarization via counterfactual estimation. In Conference on Empirical Methods in Natural Language Processing, 2021. URL https://api.semanticscholar.org/CorpusID:237353254.
- Improving factual consistency for knowledge-grounded dialogue systems via knowledge enhancement and alignment. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:263909130.
- Alignscore: Evaluating factual consistency with a unified alignment function. In Annual Meeting of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar.org/CorpusID:258947273.
- mixup: Beyond empirical risk minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb.
- Llmaaa: Making large language models as active annotators. ArXiv, abs/2310.19596, 2023. URL https://api.semanticscholar.org/CorpusID:264814421.
- PAWS: Paraphrase adversaries from word scrambling. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1298–1308, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1131. URL https://aclanthology.org/N19-1131.
- Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.