LLMpatronous: Harnessing the Power of LLMs For Vulnerability Detection

Published 25 Apr 2025 in cs.CR and cs.AI | (2504.18423v1)

Abstract: Despite the transformative impact of AI across various sectors, cyber security continues to rely on traditional static and dynamic analysis tools, hampered by high false positive rates and superficial code comprehension. While generative AI offers promising automation capabilities for software development, leveraging LLMs for vulnerability detection presents unique challenges. This paper explores the potential and limitations of LLMs in identifying vulnerabilities, acknowledging inherent weaknesses such as hallucinations, limited context length, and knowledge cut-offs. Previous attempts employing machine learning models for vulnerability detection have proven ineffective due to limited real-world applicability, feature engineering challenges, lack of contextual understanding, and the complexities of training models to keep pace with the evolving threat landscape. Therefore, we propose a robust AI-driven approach focused on mitigating these limitations and ensuring the quality and reliability of LLM based vulnerability detection. Through innovative methodologies combining Retrieval-Augmented Generation (RAG) and Mixtureof-Agents (MoA), this research seeks to leverage the strengths of LLMs while addressing their weaknesses, ultimately paving the way for dependable and efficient AI-powered solutions in securing the ever-evolving software landscape.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that combining RAG with a MoA framework significantly reduces false positives in vulnerability detection by grounding LLM analyses with external knowledge.
The methodology integrates prompt engineering with iterative agent collaboration to refine code assessments and mitigate common LLM pitfalls like hallucinations.
Experimental results on the Vuldroid application confirm the approach’s effectiveness by validating true vulnerabilities while filtering out erroneous alerts.

This paper, "LLMpatronous: Harnessing the Power of LLMs For Vulnerability Detection" (2504.18423), explores a novel approach to automating software vulnerability detection by leveraging LLMs while mitigating their inherent weaknesses like hallucinations and knowledge limitations. The core idea is to combine Retrieval-Augmented Generation (RAG) with a Mixture-of-Agents (MoA) architecture to create a more reliable and accurate analysis system compared to traditional tools or basic LLM prompting.

Traditional vulnerability detection methods like Static Application Security Testing (SAST) and Dynamic Application Security Testing (DAST) often suffer from high false positive rates and limited understanding of complex code semantics or evolving vulnerability patterns (2504.18423). While LLMs show promise due to their code comprehension abilities, their application to security tasks is hampered by potential hallucinations (generating incorrect findings) and knowledge cutoffs (lacking information on recent vulnerabilities) (2504.18423). Existing LLM-based approaches, especially those relying on simple prompts, have proven insufficient for reliable vulnerability detection (2504.18423).

The proposed LLMpatronous system integrates three key components: RAG, Prompt Engineering, and MoA.

Retrieval-Augmented Generation (RAG): To overcome LLMs' knowledge limitations and ensure analyses are based on accurate, up-to-date information, RAG is used. Before analyzing a code snippet for a specific vulnerability, the system queries an external knowledge base (like a vector database) containing detailed information on known vulnerabilities (CWEs, CVEs), code examples, and mitigation techniques (2504.18423). The retrieved relevant context is then provided to the LLM, enabling it to perform an "open-book" analysis grounded in factual knowledge (2504.18423).
- Implementation Detail: This involves indexing vulnerability information into a vector database (e.g., Pinecone is mentioned in the paper) and performing similarity searches based on the vulnerability type being checked and potentially code characteristics. The retrieved text segments are then included in the prompt to the LLM.
Prompt Engineering: Carefully crafted prompts are used to guide the LLM agents on their specific task: analyzing the provided code snippet, vulnerability information, and RAG context to determine the presence and nature of a potential vulnerability (2504.18423). Prompts define the expected output format and reasoning steps.
Mixture-of-Agents (MoA): This architecture employs multiple LLMs collaboratively to analyze the inputs and refine the findings (2504.18423). In the described setup, agents process information iteratively: a first agent generates an initial assessment, and subsequent agents take the original inputs plus the preceding agent's output to refine the analysis, correct errors, or provide further validation (2504.18423). This iterative refinement acts as a collaborative verification process, significantly reducing the impact of hallucinations and improving the reliability of the final output (2504.18423).
- Implementation Detail: This can be implemented as a sequence of API calls to different LLMs (open-source like Llama 3.1, Qwen2, or closed-source like GPT-4o, Gemini 1.5 Pro) where the output of one call is fed into the prompt for the next call.

The overall workflow for LLMpatronous is: Code Input $\rightarrow$ Select Code Snippet & Vulnerability Type $\rightarrow$ RAG Query $\rightarrow$ Retrieve & Synthesize Context $\rightarrow$ Prepare Prompt (Code, Vulnerability Info, RAG Context) $\rightarrow$ MoA Pipeline (Agent 1 $\rightarrow$ Agent 2 $\rightarrow$ ... $\rightarrow$ Final Agent) $\rightarrow$ Aggregated/Final Assessment (Vulnerability Present/Absent, Details).

To evaluate this approach, the researchers used Vuldroid, a deliberately vulnerable Android application (2504.18423). Experiments were conducted using various LLMs, including GPT-4o, Claude-3-Haiku, Gemini 1.5 Pro/Flash, Qwen2-72B, Llama-3.1-70B/405B, and DBRX-Instruct (2504.18423).

The experiments demonstrated the benefits of the RAG+MoA approach:

Experiment 1 (Basic Prompting, Focused List): Using a single LLM (Gemini 1.5 Pro) with a predefined list of Vuldroid vulnerabilities showed limited success, missing several known issues and providing somewhat inconsistent output (2504.18423).
Experiment 2 (Basic Prompting, Expanded List): Expanding the list of potential vulnerabilities for the single LLM helped identify some additional issues but also introduced likely false positives, highlighting the risk of hallucination when scanning a broader range (2504.18423).
Experiment 3 (RAG + MoA): Applying the RAG+MoA workflow to verify the findings from Experiment 2 successfully filtered out the likely false positive ("Insecure Design") while confirming the true positives (2504.18423). This validates the hypothesis that collaborative analysis grounded in external knowledge can effectively reduce false positives.

Vulnerability Candidate	File(s)	Verified by RAG+MoA
Webview XSS via DeepLink	BlogsViewer.java	True
Webview XSS via Exported Activity	YoutubeViewer.java	True
Steal Files via WebView (XHR)	YoutubeViewer.java, NotesViewer.java	True
Steal Password Reset Tokens	ForgetPassword.java	True
Reading User Email via Broadcasts	EmailViewer.java, MyReceiver.java	True
Intent Sniffing	SendMsgtoApp.java	True
Insecure Activity Handling	RoutingActivity.java	True
Hardcoded Credentials	Login.java	True
Insecure Design	NotesViewer.java	False
Insecure Input Validation	NotesViewer.java	True

Table adapted from the paper's results showing verification using RAG+MoA.

The paper concludes that LLMpatronous offers a more robust and reliable method for vulnerability detection than basic LLM prompting by effectively addressing knowledge gaps and hallucination-induced false positives through the combination of RAG and MoA (2504.18423). Using open-source models within the MoA framework was also shown to be viable (2504.18423).

However, practical implementation faces challenges. The MoA architecture increases computational cost and latency compared to single-model approaches (2504.18423). The effectiveness is dependent on maintaining a high-quality, up-to-date RAG knowledge base (2504.18423). The system still exhibited false negatives, missing some known vulnerabilities (2504.18423). Scaling to large, complex codebases and generalizing to different languages requires further work (2504.18423).

Future work suggestions include optimizing MoA efficiency (e.g., parallelization, specialized agents), enhancing the RAG knowledge base, refining prompting and analysis techniques to reduce false negatives, potentially fine-tuning specialized models, improving MoA aggregation methods, and evaluating the system on broader datasets and integrating it into DevSecOps workflows (2504.18423).