Do Users Write More Insecure Code with AI Assistants?
Abstract: We conduct the first large-scale user study examining how users interact with an AI Code assistant to solve a variety of security related tasks across different programming languages. Overall, we find that participants who had access to an AI assistant based on OpenAI's codex-davinci-002 model wrote significantly less secure code than those without access. Additionally, participants with access to an AI assistant were more likely to believe they wrote secure code than those without access to the AI assistant. Furthermore, we find that participants who trusted the AI less and engaged more with the language and format of their prompts (e.g. re-phrasing, adjusting temperature) provided code with fewer security vulnerabilities. Finally, in order to better inform the design of future AI-based Code assistants, we provide an in-depth analysis of participants' language and interaction behavior, as well as release our user interface as an instrument to conduct similar studies in the future.
- Program synthesis with large language models. https://arxiv.org/abs/2108.07732, 2021.
- Grounded copilot: How programmers interact with code-generating models. https://arxiv.org/abs/2206.15000, 2022.
- Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 1995.
- D. Boneh and V. Shoup. 6.1 Definition of a message authentication code, pages 214–217. Version 0.5 edition, 2020.
- Evaluating large language models trained on code. https://arxiv.org/abs/2107.03374, 2021.
- J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960.
- One size does not fit all: A grounded theory and online survey study of developer preferences for security warning types. In IEEE/ACM 42nd International Conference on Software Engineering, 2020.
- F. Facebook. Facebook/infer: A static analyzer for java, c, c++, and objective-c. https://github.com/facebook/infer, 2022.
- The robots are coming: Exploring the implications of openai codex on introductory programming. In Australasian Computing Education Conference, 2022.
- Stack overflow considered harmful? the impact of copy & paste on android application security. In 2017 IEEE Symposium on Security and Privacy (SP), 2017.
- Incoder: A generative model for code infilling and synthesis. https://arxiv.org/abs/2204.05999, 2022.
- Discovering the syntax and strategies of natural language programming with generative language models. In ACM CHI Conference on Human Factors in Computing Systems, 2022.
- Crysl: An extensible approach to validating the correct usage of cryptographic apis. IEEE Transactions on Software Engineering, 2021.
- Neural query expansion for code search. In ACM sigplan international workshop on machine learning and programming languages, 2019.
- Codeexchange: Supporting reformulation of internet-scale code queries in context. ASE ’15, 2015.
- B. Pang and R. Kumar. Search in the lost sense of “query”: Question formulation in web search queries and its temporal changes. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011.
- Asleep at the keyboard? assessing the security of github copilot’s code contributions. In IEEE Symposium on Security and Privacy, 2022.
- Synchromesh: Reliable code generation from pre-trained language models. In International Conference on Learning Representations, 2022.
- T. Pornin. Deterministic Usage of the Digital Signature Algorithm (DSA) and Elliptic Curve Digital Signature Algorithm (ECDSA). RFC 6979, RFC Editor, August 2013.
- J. A. Prenner and R. Robbes. Automatic program repair with openai’s codex: Evaluating quixbugs. https://arxiv.org/abs/2111.03922, 2021.
- Security implications of large language model code assistants: A user study. https://arxiv.org/abs/2208.09727, 2022.
- What is it like to program with artificial intelligence? https://arxiv.org/abs/2208.06213, 2022.
- G. Schwarz. Estimating the Dimension of a Model. The Annals of Statistics, 1978.
- spotbugs. Spotbugs. https://spotbugs.github.io/, 2022.
- M. Tabachnyk and S. Nikolov. Ml-enhanced code completion improves developer productivity. https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.html, Jul 2022.
- Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 2022.
- Understanding security mistakes developers make: Qualitative analysis from build it, break it, fix it. In USENIX Security Symposium, 2020.
- In-ide code generation from natural language: Promise and challenges. https://arxiv.org/abs/2101.11149, 2021.
- Productivity assessment of neural code completion. https://arxiv.org/abs/2205.06537, 2022.
- Fine-tuning language models from human preferences, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.