Papers
Topics
Authors
Recent
Search
2000 character limit reached

Eliciting Informative Text Evaluations with Large Language Models

Published 23 May 2024 in cs.CL, cs.AI, and cs.GT | (2405.15077v4)

Abstract: Peer prediction mechanisms motivate high-quality feedback with provable guarantees. However, current methods only apply to rather simple reports, like multiple-choice or scalar numbers. We aim to broaden these techniques to the larger domain of text-based reports, drawing on the recent developments in LLMs. This vastly increases the applicability of peer prediction mechanisms as textual feedback is the norm in a large variety of feedback channels: peer reviews, e-commerce customer reviews, and comments on social media. We introduce two mechanisms, the Generative Peer Prediction Mechanism (GPPM) and the Generative Synopsis Peer Prediction Mechanism (GSPPM). These mechanisms utilize LLMs as predictors, mapping from one agent's report to a prediction of her peer's report. Theoretically, we show that when the LLM prediction is sufficiently accurate, our mechanisms can incentivize high effort and truth-telling as an (approximate) Bayesian Nash equilibrium. Empirically, we confirm the efficacy of our mechanisms through experiments conducted on two real datasets: the Yelp review dataset and the ICLR OpenReview dataset. We highlight the results that on the ICLR dataset, our mechanisms can differentiate three quality levels -- human-written reviews, GPT-4-generated reviews, and GPT-3.5-generated reviews in terms of expected scores. Additionally, GSPPM penalizes LLM-generated reviews more effectively than GPPM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Peer prediction with heterogeneous users. In Proceedings of the 2017 ACM Conference on Economics and Computation, pages 81–98. ACM, June 2017.
  3. Robust forecast aggregation. Proceedings of the National Academy of Sciences, 115(52):E12135–E12143, 2018.
  4. Ask me anything: A simple strategy for prompting language models. In The Eleventh International Conference on Learning Representations, 2022.
  5. Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV), pages 456–473, 2018.
  6. The crowdless future? how generative ai is shaping the future of human crowdsourcing. The Crowdless Future, 2023.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Measurement integrity in peer prediction: A peer assessment case study. arXiv preprint arXiv:2108.05521, 2021.
  9. Art or artifice? large language models and the false promise of creativity. arXiv preprint arXiv:2309.14556, 2023.
  10. The wisdom of the crowd and higher-order beliefs. arXiv preprint arXiv:2102.02666, 2021.
  11. Roger Cooke. Experts in uncertainty: opinion and subjective probability in science. Oxford University Press, USA, 1991.
  12. Optimal selling strategies under uncertainty for a discriminating monopolist when demands are interdependent. Econometrica, 53(2):345–361, 1985. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/1911240.
  13. Crowdsourced judgement elicitation with endogenous proficiency. In Proceedings of the 22nd international conference on World Wide Web, pages 319–330, 2013.
  14. Benchmark probing: Investigating data leakage in large language models. In NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly, 2023.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  16. Tjibbe Donker. The dangers of using large language models for peer review. The Lancet Infectious Diseases, 2023.
  17. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
  18. Reputation inflation. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 483–484, 2018.
  19. Incentivizing evaluation via limited access to ground truth: Peer-prediction makes things worse. arXiv preprint arXiv:1606.07042, 2016.
  20. Comparing scientific abstracts generated by chatgpt to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. BioRxiv, pages 2022–12, 2022.
  21. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
  22. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007.
  23. Algorithmic robust forecast aggregation. arXiv preprint arXiv:2401.17743, 2024.
  24. Fighting reviewer fatigue or amplifying bias? considerations and recommendations for use of chatgpt and other large language models in scholarly peer review. Research Integrity and Peer Review, 8, 05 2023. doi: 10.1186/s41073-023-00133-5.
  25. Yuqing Kong. Dominantly truthful multi-task peer prediction with a constant number of tasks. In Proceedings of the fourteenth annual acm-siam symposium on discrete algorithms, pages 2398–2411. SIAM, 2020.
  26. Yuqing Kong. Dominantly truthful peer prediction mechanisms with a finite number of tasks. J. ACM, 71(2), apr 2024. ISSN 0004-5411. doi: 10.1145/3638239. URL https://doi.org/10.1145/3638239.
  27. Eliciting expertise without verification. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 195–212, 2018.
  28. An information theoretic framework for designing information elicitation mechanisms that reward truth-telling. ACM Transactions on Economics and Computation (TEAC), 7(1):1–33, 2019.
  29. Eliciting thinking hierarchy without a prior. Advances in Neural Information Processing Systems, 35:13329–13341, 2022.
  30. Optimization of scoring rules. In Proceedings of the 23rd ACM Conference on Economics and Computation, pages 988–989, 2022.
  31. Can large language models provide useful feedback on research papers? a large-scale empirical analysis, 2023.
  32. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022.
  33. Reviewergpt? an exploratory study on using large language models for paper reviewing, 2023.
  34. Surrogate scoring rules. ACM Trans. Econ. Comput., 10(3), feb 2023. ISSN 2167-8375. doi: 10.1145/3565559. URL https://doi.org/10.1145/3565559.
  35. Calibrating “cheap signals” in peer review without a prior. Advances in Neural Information Processing Systems, 36, 2024.
  36. Eliciting informative feedback: The peer-prediction method. Management Science, 51(9):1359–1373, 2005.
  37. Probing neural network comprehension of natural language arguments. arXiv preprint arXiv:1907.07355, 2019.
  38. Extracting the wisdom of crowds when information is shared. Management Science, 65(5):2291–2309, 2019.
  39. Robust decision aggregation with second-order information. arXiv preprint arXiv:2311.14094, 2023.
  40. Drazen Prelec. A bayesian truth serum for subjective data. science, 306(5695):462–466, 2004.
  41. A solution to the single-question crowd wisdom problem. Nature, 541(7638):532–535, 2017.
  42. A robust bayesian truth serum for non-binary signals. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI’13, page 833–839. AAAI Press, 2013.
  43. Incentives for truthful information elicitation of continuous signals. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, AAAI’14, page 770–776. AAAI Press, 2014.
  44. Reputation systems. Communications of the ACM, 43(12):45–48, 2000.
  45. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. arXiv preprint arXiv:2310.18018, 2023.
  46. Can artificial intelligence help for scientific writing? Critical care, 27(1):1–5, 2023.
  47. Learning and strongly truthful multi-task peer prediction: A variational approach. arXiv preprint arXiv:2009.14730, 2020.
  48. Two strongly truthful mechanisms for three heterogeneous agents answering one question. ACM Trans. Econ. Comput., 10(4), feb 2023a. ISSN 2167-8375. doi: 10.1145/3565560. URL https://doi.org/10.1145/3565560.
  49. Two strongly truthful mechanisms for three heterogeneous agents answering one question. ACM Transactions on Economics and Computation, 10(4):1–26, 2023b.
  50. Information elicitation from rowdy crowds. In Proceedings of the Web Conference 2021, WWW ’21, page 3974–3986, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383127. doi: 10.1145/3442381.3449840. URL https://doi.org/10.1145/3442381.3449840.
  51. Reinhard Selten. Axiomatic characterization of the quadratic scoring rule. Experimental Economics, 1:43–61, 1998.
  52. Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
  53. Informed truthfulness in multi-task peer prediction. In Proceedings of the 2016 ACM Conference on Economics and Computation, pages 179–196, 2016.
  54. Auctions and peer prediction for scientific peer review. ArXivorg, 2021.
  55. Steven Tadelis. Reputation and feedback systems in online platform markets. Annual Review of Economics, 8:321–340, 2016.
  56. Mike Thelwall. Can chatgpt evaluate research quality? arXiv preprint arXiv:2402.05519, 2024.
  57. Llama 2: Open foundation and fine-tuned chat models, 2023.
  58. Jeroen PH Verharen. ChatGPT identifies gender disparities in scientific peer review. eLife, 12:RP90230, nov 2023. ISSN 2050-084X. doi: 10.7554/eLife.90230. URL https://doi.org/10.7554/eLife.90230.
  59. A robust bayesian truth serum for small populations. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI’12, page 1492–1498. AAAI Press, 2012a.
  60. A robust bayesian truth serum for small populations. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI’12, page 1492–1498. AAAI Press, 2012b.
  61. Spot check equivalence: an interpretable metric for information elicitation mechanisms. arXiv preprint arXiv:2402.13567, 2024.
  62. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=-Aw0rrrPUF.
  63. Elicitability and knowledge-free elicitation with peer prediction. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 245–252, 2014.
  64. Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12:39–57, 2024.
  65. Multitask peer prediction with task-dependent strategies. In Proceedings of the ACM Web Conference 2023, WWW ’23, page 3436–3446, New York, NY, USA, 2023a. Association for Computing Machinery. ISBN 9781450394161. doi: 10.1145/3543507.3583292. URL https://doi.org/10.1145/3543507.3583292.
  66. High-effort crowds: Limited liability via tournaments. In Proceedings of the ACM Web Conference 2023, WWW ’23, page 3467–3477, New York, NY, USA, 2023b. Association for Computing Machinery. ISBN 9781450394161. doi: 10.1145/3543507.3583334. URL https://doi.org/10.1145/3543507.3583334.
  67. ClusterLLM: Large language models as a guide for text clustering. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13903–13920, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.858. URL https://aclanthology.org/2023.emnlp-main.858.
Citations (3)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 0 likes about this paper.