Contextualizing Sink Knowledge for Java Vulnerability Discovery

Published 2 Apr 2026 in cs.CR | (2604.01645v2)

Abstract: Java applications are prone to vulnerabilities stemming from the insecure use of security-sensitive APIs, such as file operations enabling path traversal or deserialization routines allowing remote code execution. These sink APIs encode critical information for vulnerability discovery: the program-specific constraints required to reach them and the exploitation conditions necessary to trigger security flaws. Despite this, existing fuzzers largely overlook such vulnerability-specific knowledge, limiting their effectiveness. We present GONDAR, a sink-centric fuzzing framework that systematically leverages sink API semantics for targeted vulnerability discovery. GONDAR first identifies reachable and exploitable sink call sites through CWE-specific scanning combined with LLM-assisted static filtering. It then deploys two specialized agents that work collaboratively with a coverage-guided fuzzer: an exploration agent generates inputs to reach target call sites by iteratively solving path constraints, while an exploitation agent synthesizes proof-of-concept exploits by reasoning about and satisfying vulnerability-triggering conditions. The agents and fuzzer continuously exchange seeds and runtime feedback, complementing each other. We evaluated GONDAR on real-world Java benchmarks, where it discovers four times more vulnerabilities than Jazzer, the state-of-the-art Java fuzzer. Notably, GONDAR also demonstrated strong performance in the DARPA AI Cyber Challenge, and is integrated into OSS-CRS, a sandbox project in The Linux Foundation's OpenSSF, to improve the security of open-source software.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces GONDAR, a sink-centric fuzzing framework that integrates LLM-assisted static/dynamic analysis to advance Java vulnerability discovery.
It decomposes the process into sink reachability and exploitability phases using collaborative exploration and exploitation agents.
Empirical results highlight up to a 4–5× improvement in vulnerability detection, with broader CWE coverage compared to baseline tools.

Contextualizing Sink Knowledge for Java Vulnerability Discovery: An Expert Analysis

Problem Space and Motivation

The security posture of Java applications is critical, given their prevalence in enterprise infrastructures and the demonstrated impact of recent high-profile vulnerabilities. Many such vulnerabilities result from the insecure invocation of security-sensitive APIs ("sinks"), where untrusted input may trigger unsafe program behaviors. However, dynamic vulnerability discovery tools for Java, especially coverage-guided fuzzers (e.g., Jazzer), have failed to scale past structural limitations: these tools do not systematically leverage semantic and contextual knowledge about security sinks, thus leaving large classes of vulnerabilities underexplored and unexploited.

The core challenge lies in decomposing the vulnerability discovery process into two orthogonal problems: sink reachability (finding input paths that exercise the sink) and sink exploitability (synthesizing inputs that satisfy nontrivial sanitization or precondition checks imposed at or before the sink). Previous general fuzzing and even sink-aware approaches predominantly focus on structural path exploration, with little automated reasoning about semantic exploitation constraints. As a result, the "last mile" of exploitation remains a major bottleneck in automated discovery—coverage tools frequently reach, but do not exploit, complex vulnerability sites.

GONDAR Architecture and Approach

This work introduces GONDAR, a generative, sink-centric, semi-automated fuzzing framework that systematically integrates sink knowledge via LLM-assisted static/dynamic analysis, collaborative agents, and coverage-guided fuzzing. It is designed for cross-CWE applicability and compatibility with industrial-grade fuzzing pipelines (e.g., OSS-Fuzz).

Sink Detection

GONDAR's sink detection leverages an extensible mapping from CWE identifiers to sink APIs, underpinned by CodeQL's comprehensive database. To maximize recall and support extensibility, conservative extraction logic is applied to extract all CWE-relevant call sites, bypassing restrictive taint analysis. Multi-stage filtering (constant-argument elimination, test code exclusion, call graph reachability, and LLM-powered exploitability assessment) reduces noise and confines the search to actionable sink sites, while optimizing for very high recall.

Exploration and Exploitation Agents

The engine decomposes vulnerability discovery into two LLM-empowered agent phases:

Exploration Agent: Given static code context (notably, call paths to each sink), the agent synthesizes semantically valid inputs that exercise hard-to-reach sites, addressing path constraints and complex validation through code reasoning. Unlike coverage-guided mutation, this enables overcoming cryptographic checks, stringent input format requirements, and sophisticated branching predicates.
Exploitation Agent: For each "beep seed" (input reaching a sink), dynamic execution traces, sanitizer feedback, and surrounding code context are analyzed. The agent attempts iterative proof-of-concept exploit synthesis, with feedback-driven refinement. This is further supported by a value-profile-only fuzzing phase, using agent-generated near-miss exploits as seeds.

Agents communicate bi-directionally with fuzzing: generated seeds rapidly enrich fuzzer corpora, while runtime-derived beep seeds form the starting point for exploitation.

Empirical Results

GONDAR is systematically benchmarked against the state-of-the-art on a new dataset comprising 54 vulnerabilities across 22 Java projects (12 CWE classes). Key findings:

Vulnerability Discovery: Across the primary configuration (GONDAR with flagship LLM, Gemini 2.5-Pro), 41 vulnerabilities are exploited and 46 reached—demonstrating over 4× exploited vulnerabilities and nearly 2× reached vulnerabilities compared to baseline Jazzer (8 and 26, respectively).
Ablation Analysis: Disabling the exploration or exploitation agent reduces coverage and effectiveness substantially (29 reached/18 exploited or 42 reached/18 exploited, respectively). The synergy between fuzzer and agents is significant, with seven findings only possible via their collaboration.
Categorical Coverage: GONDAR achieves higher coverage across all evaluated CWE groups. Notably, it excels at vulnerabilities that require complex, structured, or stateful inputs—domains where mutation-based fuzzers plateau.
Cost Analysis: Despite the substantial cost of LLM queries in flagship configurations ($3K range), GONDAR remains more cost-effective and scales to higher effectiveness than large-scale resource-intensive fuzzing, highlighting the value of semantic guidance over brute-force resource scaling.
Generalizability and Open-Source Integration: The framework’s open architecture supports extension to further CWEs and languages, and integration with OpenSSF's OSS-CRS system confirms utility for continuous, automated open-source security testing. Real-world utility is underscored by its performance in the DARPA AI Cyber Challenge.

Theoretical and Practical Implications

This work substantiates multiple critical points for the design of modern security testing frameworks:

Semantic Reasoning Is Essential: Mere structural path coverage is inadequate for complex vulnerability discovery; semantic contextualization of sink knowledge, both statically and via dynamic feedback, is mandatory to bridge the reachability-exploitability gap.
LLM Integration: LLMs, when tasked with program context and properly grounded, empower automated agents to traverse complex program logic and craft tailored exploits—making dynamic fuzzing far more effective and generalizable.
Scalability With Generality: The system's design allows straightforward extension to new CWEs by engineering lightweight query scripts and descriptive artifacts, shifting the bottleneck from tool development to explicit sink knowledge curation.

Limitations and Prospects

While GONDAR demonstrates significant advances, current limitations originate from (1) the intrinsic challenges of generating highly complex or deeply stateful input formats (e.g., nested serialization, non-trivial binary archives), (2) incomplete static analysis for reflective/dynamic calls, and (3) agent weaknesses in some edge CWE contexts (e.g., advanced deserialization chains). These are not fundamental to the architecture, and advances in LLMs, specialized CWE agents, or hybrid analysis techniques can further reduce the vulnerability gap. Additionally, extension to languages beyond Java will require nontrivial engineering, especially around static analysis and debugger integration.

Future Directions

Future work should explore:

Model Ensemble/Composition: Combining multiple LLMs with complementary strengths, or integrating with domain-specific symbolic or concolic engines, to expand coverage and robustness.
CWE- or Project-Specific Tuning: Agent augmentation with custom exploit generators (e.g., deserialization gadget mining, ReDoS exploiters) for hard vulnerability classes.
Automated Harness Generation: Integrating with LLM-powered driver/harness synthesis pipelines to further reduce human effort and increase codebase coverage.
Continuous Integration and Real-Time Protection: Deploying in CI/CD pipelines for ongoing, incremental analysis and integrating feedback with maintainers for direct remediation.

Conclusion

GONDAR establishes the necessity and efficacy of systematically contextualized sink knowledge for automated Java vulnerability discovery. Its collaborative, agent-fuzzer architecture leveraging LLMs marks a decisive shift toward semantically aware dynamic analysis, yielding substantial practical and theoretical improvements. The demonstrated scalability, strong numerical results (up to 4–5× improvement), and industrial validation position GONDAR’s paradigm as a blueprint for both research and production-grade secure software development.

Markdown Report Issue