Evaluating Superhuman Models with Consistency Checks
Abstract: If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth? In this paper, we propose a framework for evaluating superhuman models via consistency checks. Our premise is that while the correctness of superhuman decisions may be impossible to evaluate, we can still surface mistakes if the model's decisions fail to satisfy certain logical, human-interpretable rules. We instantiate our framework on three tasks where correctness of decisions is hard to evaluate due to either superhuman model abilities, or to otherwise missing ground truth: evaluating chess positions, forecasting future events, and making legal judgments. We show that regardless of a model's (possibly superhuman) performance on these tasks, we can discover logical inconsistencies in decision making. For example: a chess engine assigning opposing valuations to semantically identical boards; GPT-4 forecasting that sports records will evolve non-monotonically over time; or an AI judge assigning bail to a defendant only after we add a felony to their criminal record.
- Encyclopaedia of chess openings, volume B (2nd ed.). Chess Informant, 1984. ISBN 0-7134-3716-2.
- Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
- Machine bias. In Ethics of data and analytics, pages 254–264. Auerbach Publications, 2016.
- Good and safe uses of ai oracles. arXiv preprint arXiv:1711.05541, 2017.
- Lc0 authors. What is Lc0?, 2018. URL https://lczero.org/dev/wiki/what-is-lc0/. [Online; Last accessed 05-April-2023].
- Mastering the game of no-press Diplomacy via human-regularized reinforcement learning and planning. arXiv preprint arXiv:2210.05492, 2022.
- A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210, 2023.
- Writeup: Progress on AI safety via debate, 2020, 2020. URL https://www.alignmentforum.org/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1.
- Big data’s disparate impact. California law review, pages 671–732, 2016.
- Yoshua Bengio. AI scientists: Safe and useful AI?, 2023. URL https://yoshuabengio.org/2023/05/07/ai-scientists-safe-and-useful-ai/. Online; accessed 10-May-2023.
- Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022.
- Gwern Branwen. The scaling hypothesis, 2021.
- Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
- Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
- Artificial influence: An analysis of AI-driven persuasion. arXiv preprint arXiv:2303.08721, 2023.
- Caissabase, 2023. URL http://caissabase.co.uk/. Accessed on 13-May-2023.
- Deep blue. Artificial intelligence, 134(1-2):57–83, 2002.
- Neural legal judgment prediction in English. arXiv preprint arXiv:1906.02059, 2019.
- LEGAL-BERT: The Muppets straight out of law school. arXiv preprint arXiv:2010.02559, 2020.
- Sam Chann. Nondeterminism in Non-determinism in GPT-4 is caused by Sparse MoE, 2023. URL https://web.archive.org/web/20230908235421/https://152334h.github.io/blog/non-determinism-in-gpt-4/. Accessed on 27-Sept-2023.
- Semi-supervised learning. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Metamorphic testing: a new approach for generating next test cases. Technical report, The Hong Kong University of Science and Technology, 1998.
- Specifying and testing k𝑘kitalic_k-safety properties for machine-learning models. arXiv preprint arXiv:2206.06054, 2022.
- Eliciting latent knowledge: How to tell if your eyes deceive you, 2022. URL https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge. Accessed on 13-May-2023.
- LM vs LM: Detecting factual errors via cross examination. arXiv preprint arXiv:2305.13281, 2023.
- A survey on legal judgment prediction: Datasets, metrics, models and challenges. arXiv preprint arXiv:2204.04859, 2022.
- BMT: Behavior driven development-based metamorphic testing for autonomous driving models. In 2021 IEEE/ACM 6th International Workshop on Metamorphic Testing (MET), pages 32–36. IEEE, 2021.
- Lc0 developers. Leela Chess Zero. https://github.com/LeelaChessZero/lc0, 2018.
- The accuracy, fairness, and limits of predicting recidivism. Science advances, 4(1):eaao5580, 2018.
- Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012.
- Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012–1031, 2021.
- Truthful AI: Developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674, 2021.
- Niklas Fiekas. Syzygy endgame tablebases, 2023. URL https://syzygy-tables.info/. Accessed on 31-May-2023.
- Paul Fishwick. A question on determinism. OpenAI Comunity Forum, Aug 2021. URL https://web.archive.org/web/20230328011953/https://community.openai.com/t/a-question-on-determinism/8185/2.
- Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764, 2022.
- Evaluating models’ local decision boundaries via contrast sets. arXiv preprint arXiv:2004.02709, 2020.
- Logical induction. arXiv preprint arXiv:1609.03543, 2016.
- Significant Gravitas. Auto-GPT: An autonomous GPT-4 experiment, 2023. URL https://github.com/Significant-Gravitas/Auto-GPT.
- Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016.
- Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
- X-risk analysis for AI research. arXiv preprint arXiv:2206.05862, 2022.
- Understanding by understanding not: Modeling negation in language models. arXiv preprint arXiv:2105.03519, 2021.
- AI safety via debate. arXiv preprint arXiv:1805.00899, 2018.
- Consistency analysis of ChatGPT. arXiv preprint arXiv:2303.06273, 2023.
- Accurate, yet inconsistent? consistency analysis on language understanding models. arXiv preprint arXiv:2108.06665, 2021.
- BECEL: Benchmark for consistency evaluation of language models. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3680–3696, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.324.
- Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328, 2017.
- Surya Mattu Julia Angwin, Jeff Larson and Lauren Kirchner. Machine bias. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing, May 2016. [Online; accessed 17-December-2022].
- Reluplex: An efficient SMT solver for verifying deep neural networks. In Computer Aided Verification: 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I 30, pages 97–117. Springer, 2017.
- Human decisions and machine predictions. The quarterly journal of economics, 133(1):237–293, 2018.
- Will Knight. Alpha Zero’s “alien” chess shows the power, and the peculiarity, of AI, 2017. URL https://www.technologyreview.com/2017/12/08/147199/.
- Are AlphaZero-like agents robust to adversarial perturbations? arXiv preprint arXiv:2211.03769, 2022.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- TruthfulQA: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Parallel search of strongly ordered game trees. ACM Computing Surveys (CSUR), 14(4):533–551, 1982.
- Marco Meloni. Stockfish and Lc0, test at different number of nodes, Nov 2022. URL https://www.melonimarco.it/en/2021/03/08/stockfish-and-lc0-test-at-different-number-of-nodes/. Accessed on 13-May-2023.
- Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
- Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Incentivizing honest performative predictions with proper scoring rules. arXiv preprint arXiv:2305.17601, 2023.
- Do the rewards justify the means? Measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark. arXiv preprint arXiv:2304.03279, 2023.
- DeepXplore: Automated whitebox testing of deep learning systems. In proceedings of the 26th Symposium on Operating Systems Principles, pages 1–18, 2017.
- Performative prediction, 2020.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118, 2020.
- Testing monotonicity of machine learning models, 2020.
- Mastering the game of Go without human knowledge. Nature, 550(7676):354–359, 2017.
- A general reinforcement learning algorithm that masters chess, Shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018.
- Evolutionary algorithms and their applications to engineering problems. Neural Computing and Applications, 32:12363–12379, 2020.
- Markus Sobkowski. Manifold Markets: User GPT-4 (Bot), 2023. URL https://web.archive.org/web/20230511132857/https://manifold.markets/GPT4?tab=portfolio. Accessed on 11-May-2023.
- Stockfish 15.1. Stockfish 15.1, 2023. URL https://stockfishchess.org/. Accessed on 22-Jun-2023.
- Stockfish developers. Stockfish official repository. https://github.com/official-stockfish/Stockfish, 2023.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- DeepTest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th international conference on software engineering, pages 303–314, 2018.
- Approximate exploitability: Learning a best response in large games. arXiv preprint arXiv:2004.09677, 2020.
- Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv preprint arXiv:2305.04388, 2023.
- Fairness definitions explained. In Proceedings of the international workshop on software fairness, pages 1–7, 2018.
- Adversarial policies beat professional-level Go AIs. arXiv preprint arXiv:2211.00241, 2022a.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022b.
- Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software, 84(4):544–558, 2011.
- Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering, 48(1):1–36, 2020.
- DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pages 132–142, 2018.
- Forecasting future world events with neural networks. arXiv preprint arXiv:2206.15474, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.