Papers
Topics
Authors
Recent
Search
2000 character limit reached

Knowledge-based Consistency Testing of Large Language Models

Published 3 Jul 2024 in cs.CL, cs.AI, and cs.LG | (2407.12830v2)

Abstract: In this work, we systematically expose and measure the inconsistency and knowledge gaps of LLMs. Specifically, we propose an automated testing framework (called KonTest) which leverages a knowledge graph to construct test cases. KonTest probes and measures the inconsistencies in the LLM's knowledge of the world via a combination of semantically-equivalent queries and test oracles (metamorphic or ontological oracle). KonTest further mitigates knowledge gaps via a weighted LLM model ensemble. Using four state-of-the-art LLMs (Falcon, Gemini, GPT3.5, and Llama2), we show that KonTest generates 19.2% error inducing inputs (1917 errors from 9979 test inputs). It also reveals a 16.5% knowledge gap across all tested LLMs. A mitigation method informed by KonTest's test suite reduces LLM knowledge gap by 32.48%. Our ablation study further shows that GPT3.5 is not suitable for knowledge-based consistency testing because it is only 60%-68% effective in knowledge construction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Aldeida Aleti. 2023. Software testing of generative ai systems: Challenges and opportunities. arXiv preprint arXiv:2309.03554.
  2. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  3. The reversal curse: Llms trained on "a is b" fail to learn "b is a".
  4. Chart2000. Top 200 artists of 2000 to 2023. https://chart2000.com/artists.htm.
  5. Mapgpt: Map-guided prompting for unified vision-and-language navigation. arXiv preprint arXiv:2401.07314.
  6. Google Cloud. 2023. Vertex ai api.
  7. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 423–435.
  8. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533.
  9. Turbulence: Systematically and automatically testing instruction-tuned large language models for code. arXiv preprint arXiv:2312.14856.
  10. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620.
  11. Bias assessment and mitigation in llm-based code generation. arXiv preprint arXiv:2309.14345.
  12. Composite backdoor attacks against large language models. arXiv preprint arXiv:2310.07676.
  13. Adaptive intellect unleashed: The feasibility of knowledge transfer in large language models. arXiv preprint arXiv:2308.04788.
  14. IMF. Gdp per capita, current prices (u.s. dollars per capita). http://imf.org/external/datamapper/NGDPDPC@WEO/OEMDC/ADVEC/WEOWORLD?year=2023/.
  15. Zhenlong Li and Huan Ning. 2023. Autonomous gis: the next-generation ai-powered gis. arXiv preprint arXiv:2305.06453.
  16. Mapbox. Mapgpt: Deliver natural conversations with a location-intelligent ai assistant. https://www.mapbox.com/mapgpt.
  17. Beyond accuracy: Evaluating self-consistency of code large language models with identitychain. arXiv preprint arXiv:2310.14053.
  18. Nomic AI. 2023. Falcon 7b model.
  19. OpenAI. 2023. Gpt3.5 models.
  20. Llm is like a box of chocolates: the non-determinism of chatgpt in code generation. arXiv preprint arXiv:2308.02828.
  21. Evaluating and explaining large language models for code using syntactic structures. arXiv preprint arXiv:2308.03873.
  22. Unifying large language models and knowledge graphs: A roadmap. arXiv preprint arXiv:2306.08302.
  23. Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164.
  24. Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. arXiv preprint arXiv:2310.08559.
  25. GPT4GEO: how a language model sees the world’s geography.
  26. Benchmarking causal study to interpret large language models for source code. arXiv preprint arXiv:2308.12415.
  27. Breaking the silence: the threats of using llms in software engineering. arXiv preprint arXiv:2312.08055.
  28. Beyond memorization: Violating privacy via inference with large language models. arXiv preprint arXiv:2310.07298.
  29. Llama: Open and efficient foundation language models.
  30. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85.
  31. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477.
  32. Beyond testers’ biases: Guiding model testing with knowledge bases using llms. arXiv preprint arXiv:2310.09668.
  33. What do code models memorize? an empirical study on large language models of code. arXiv preprint arXiv:2308.09932.
  34. L3mvn: Leveraging large language models for visual target navigation. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE.
  35. A survey of large language models. arXiv preprint arXiv:2303.18223.
  36. Towards an understanding of large language models in software engineering tasks. arXiv preprint arXiv:2308.11396.
  37. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 11 likes about this paper.