Knowledge-based Consistency Testing of Large Language Models
Abstract: In this work, we systematically expose and measure the inconsistency and knowledge gaps of LLMs. Specifically, we propose an automated testing framework (called KonTest) which leverages a knowledge graph to construct test cases. KonTest probes and measures the inconsistencies in the LLM's knowledge of the world via a combination of semantically-equivalent queries and test oracles (metamorphic or ontological oracle). KonTest further mitigates knowledge gaps via a weighted LLM model ensemble. Using four state-of-the-art LLMs (Falcon, Gemini, GPT3.5, and Llama2), we show that KonTest generates 19.2% error inducing inputs (1917 errors from 9979 test inputs). It also reveals a 16.5% knowledge gap across all tested LLMs. A mitigation method informed by KonTest's test suite reduces LLM knowledge gap by 32.48%. Our ablation study further shows that GPT3.5 is not suitable for knowledge-based consistency testing because it is only 60%-68% effective in knowledge construction.
- Aldeida Aleti. 2023. Software testing of generative ai systems: Challenges and opportunities. arXiv preprint arXiv:2309.03554.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- The reversal curse: Llms trained on "a is b" fail to learn "b is a".
- Chart2000. Top 200 artists of 2000 to 2023. https://chart2000.com/artists.htm.
- Mapgpt: Map-guided prompting for unified vision-and-language navigation. arXiv preprint arXiv:2401.07314.
- Google Cloud. 2023. Vertex ai api.
- Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 423–435.
- Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533.
- Turbulence: Systematically and automatically testing instruction-tuned large language models for code. arXiv preprint arXiv:2312.14856.
- Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620.
- Bias assessment and mitigation in llm-based code generation. arXiv preprint arXiv:2309.14345.
- Composite backdoor attacks against large language models. arXiv preprint arXiv:2310.07676.
- Adaptive intellect unleashed: The feasibility of knowledge transfer in large language models. arXiv preprint arXiv:2308.04788.
- IMF. Gdp per capita, current prices (u.s. dollars per capita). http://imf.org/external/datamapper/NGDPDPC@WEO/OEMDC/ADVEC/WEOWORLD?year=2023/.
- Zhenlong Li and Huan Ning. 2023. Autonomous gis: the next-generation ai-powered gis. arXiv preprint arXiv:2305.06453.
- Mapbox. Mapgpt: Deliver natural conversations with a location-intelligent ai assistant. https://www.mapbox.com/mapgpt.
- Beyond accuracy: Evaluating self-consistency of code large language models with identitychain. arXiv preprint arXiv:2310.14053.
- Nomic AI. 2023. Falcon 7b model.
- OpenAI. 2023. Gpt3.5 models.
- Llm is like a box of chocolates: the non-determinism of chatgpt in code generation. arXiv preprint arXiv:2308.02828.
- Evaluating and explaining large language models for code using syntactic structures. arXiv preprint arXiv:2308.03873.
- Unifying large language models and knowledge graphs: A roadmap. arXiv preprint arXiv:2306.08302.
- Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164.
- Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. arXiv preprint arXiv:2310.08559.
- GPT4GEO: how a language model sees the world’s geography.
- Benchmarking causal study to interpret large language models for source code. arXiv preprint arXiv:2308.12415.
- Breaking the silence: the threats of using llms in software engineering. arXiv preprint arXiv:2312.08055.
- Beyond memorization: Violating privacy via inference with large language models. arXiv preprint arXiv:2310.07298.
- Llama: Open and efficient foundation language models.
- Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85.
- Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477.
- Beyond testers’ biases: Guiding model testing with knowledge bases using llms. arXiv preprint arXiv:2310.09668.
- What do code models memorize? an empirical study on large language models of code. arXiv preprint arXiv:2308.09932.
- L3mvn: Leveraging large language models for visual target navigation. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Towards an understanding of large language models in software engineering tasks. arXiv preprint arXiv:2308.11396.
- Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.