Papers
Topics
Authors
Recent
Search
2000 character limit reached

Guiding Symbolic Execution with Static Analysis and LLMs for Vulnerability Discovery

Published 7 Apr 2026 in cs.CR and cs.SE | (2604.06506v1)

Abstract: Symbolic execution detects vulnerabilities with precision, but applying it to large codebases requires harnesses that set up symbolic state, model dependencies, and specify assertions. Writing these harnesses has traditionally been a manual process requiring expert knowledge, which significantly limits the scalability of the technique. We present Static Analysis Informed and LLM-Orchestrated Symbolic Execution (SAILOR), which automates symbolic execution harness construction by combining static analysis with LLM-based synthesis. SAILOR operates in three phases: (1) static analysis identifies candidate vulnerable locations and generates vulnerability specifications; (2) an LLM uses vulnerability specifications and orchestrates harness synthesis by iteratively refining drivers, stubs, and assertions against compiler and symbolic execution feedback; symbolic execution then detects vulnerabilities using the generated harness, and (3) concrete replay validates the symbolic execution results against the unmodified project source. This design combines the scalability of static analysis, the code reasoning of LLMs, the path precision of symbolic execution, and the ground truth produced by concrete execution. We evaluate SAILOR on 10 open-source C/C++ projects totaling 6.8 M lines of code. SAILOR discovers 379 distinct, previously unknown memory-safety vulnerabilities (421 confirmed crashes). The strongest of five baselines we compare SAILOR to (agentic vulnerability detection using Claude Code with full codebase access and unlimited interaction), finds only 12 vulnerabilities. Each phase of SAILOR is critical: Without static analysis targeting confirmed vulnerabilities drop 12.2X; without iterative LLM synthesis zero vulnerabilities are confirmed; and without symbolic execution no approach can detect more than 12 vulnerabilities.

Summary

  • The paper introduces SAILOR, an automated pipeline that leverages static analysis, LLM-driven harness synthesis, and symbolic execution to identify vulnerabilities.
  • It employs CodeQL and SARIF outputs to accurately filter potential targets and iteratively refines execution harnesses using compiler diagnostics and feedback from symbolic execution.
  • SAILOR validates vulnerabilities with AddressSanitizer, uncovering 379 unique vulnerabilities across 10 projects, thereby enhancing detection scalability and precision.

Guiding Symbolic Execution with Static Analysis and LLMs for Vulnerability Discovery

Introduction

The paper presents SAILOR, an innovative approach to automate symbolic execution harness construction, addressing scalability in vulnerability detection for large codebases. This technique integrates static analysis (SA), LLM synthesis, and symbolic execution (SE) to form a robust pipeline, guiding execution based on vulnerability specifications. This method tackles traditional bottlenecks in symbolic execution, where manual harness creation requires extensive domain knowledge, limiting applicability across large projects.

Methodology

SAILOR operates in three distinct phases:

  1. Static Analysis Informed Target Generation: This phase employs static analysis to identify potential vulnerabilities, generating specifications that include detailed vulnerability descriptions and entry point selections. It utilizes CodeQL for scanning and SARIF formatted outputs to facilitate enriched findings, filtering irrelevant paths to ensure a concise set of actionable targets.
  2. LLM-Orchestrated Symbolic Execution: Leveraging LLMs, this phase synthesizes execution harnesses iteratively, refining drivers, stubs, and assertions against compiler diagnostics and symbolic execution outcomes. Using KLEE, the SE component seeks to validate or refute the vulnerability based on the synthesized harnesses. The orchestration is guided by a mix of exploratory and authoring interactions with the source code, enabling dynamic harness adjustments.
  3. Concrete Validation: Witness inputs from SE are replayed against the original source code using AddressSanitizer to confirm vulnerabilities, ensuring precision by removing false positives from unrealistic harnesses generated in preceding phases.

Results

Evaluating SAILOR across ten open-source C/C++ projects reveals substantial effectiveness in discovering vulnerabilities. The pipeline identified 379 unique vulnerabilities with 421 confirmed crashes, significantly outperforming other methods. Notably, SAILOR detected vulnerabilities span various types, including heap-buffer-overflow, use-after-free, and stack-buffer-overflow.

Compared against five baselines, SAILOR demonstrated superior scalability and precision. The integration of SA, LLM synthesis, and SE proved critical; without any one component, the detection rate significantly drops. The paper underscores that LLMs alone or symbolic execution without orchestrated harnesses fall short of comprehensive detection due to high false-positive rates and compilation failures in generated harnesses.

Implications and Future Work

SAILOR's comprehensive and automated approach to symbolic execution opens avenues for scaling vulnerability discovery across increasingly complex software projects. The separation of analytical phases allows future adaptation to alternative static analysis tools, LLMs, or symbolic execution frameworks, potentially enhancing detection capabilities and efficiency.

Given the success noted in SAILOR's application to large projects, future research could explore extending the model to different programming languages and integrating additional vulnerability detection heuristics. Continued refinement of LLM capabilities for harness generation and adjustments in SE to handle API complexities could further bolster this approach's applicability and effectiveness.

Conclusion

SAILOR demonstrates an effective integration of static analysis, LLMs, and symbolic execution for automated vulnerability discovery. The pipeline's novel structure challenges existing methods by achieving substantial scalability and precision without manual harness creation, setting a new standard for automated security analysis in software engineering.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Guiding Symbolic Execution with Static Analysis and LLMs for Vulnerability Discovery — explained simply

What is this paper about?

This paper is about a new way to automatically find serious security bugs in huge C/C++ programs. The authors built a system called SAILOR that combines three strengths:

  • Static analysis (fast scanners that spot suspicious code patterns)
  • LLMs that can write and fix code
  • Symbolic execution (a precise way to “try all possible paths” in code and prove a bug is real)

The big idea: let static analysis point to likely danger spots, let an LLM build a special test setup to reach those spots, use symbolic execution to prove a bug exists, and finally run the real program to double-check the crash is real.

What questions did the researchers want to answer?

They focused on three simple questions:

  • How can we automatically build the tricky “test harnesses” needed to precisely reach possible bugs inside big codebases?
  • If we combine static analysis, LLMs, and symbolic execution, can we find more real bugs than other methods?
  • Can we confirm each bug with a real crash so we avoid false alarms?

How did they do it? (Approach in everyday language)

Think of SAILOR as a well-coordinated team doing a treasure hunt in a giant maze of code:

  1. Static analysis: the map reader It scans the code without running it (like a spell-checker for code) and marks “suspicious spots.” For example, it might flag a line that copies data without checking the size, which can cause a buffer overflow. Each flagged spot becomes a “vulnerability specification” describing:
  • Where the suspicious line is
  • What type of bug it might be (like “buffer overflow”)
  • Helpful hints (nearby function calls, size variables, and which function might be a good entry point)
  1. LLM-orchestrated symbolic execution: the builder and explorer To actually reach the suspicious line, you need a custom test setup called a harness. This is like building a small room around the danger spot, so you can safely test it.
  • Harness (driver): a tiny program that sets up inputs and calls the right function so symbolic execution can explore the path to the suspected bug.
  • Stubs: fake stand-ins for parts of the program that aren’t important for the bug (like using a dummy door instead of building a whole house around it). This keeps the test focused and fast.
  • Assertions: simple safety rules, like “the number of bytes to copy must be smaller than the size of the destination buffer.” The harness can deliberately break these rules to see if a crash happens.

The LLM writes this harness. If the code doesn’t compile or the symbolic execution can’t reach the spot, the system reports what went wrong, and the LLM fixes the harness in small steps. Symbolic execution then tries paths through the code (like a choose-your-own-adventure that uses algebra) to see if the bug can really happen and produces exact input values that trigger it.

  1. Concrete validation: the referee Finally, the system runs the real, unmodified program with the exact inputs found by symbolic execution, using a tool called AddressSanitizer that catches memory errors at runtime. If the real program crashes at the right place, the bug is confirmed.

Key terms in simple words:

  • Static analysis: scanning code for risky patterns without running it.
  • LLM: an AI assistant that can read, write, and fix code.
  • Symbolic execution: a method that treats inputs like variables and explores many paths at once to see if a bug is possible, then finds real inputs to trigger it.
  • Harness: a small, custom test program that sets up the exact conditions for a suspected bug.
  • Stubs: fake versions of functions to simplify testing.
  • Assertions: safety rules that define what should never happen (used to catch problems).
  • AddressSanitizer: a runtime tool that detects memory errors when the program runs.

What did they find, and why does it matter?

The team tested SAILOR on 10 large, real-world C/C++ projects (about 6.8 million lines of code total), including famous ones like FFmpeg, OpenSSL, libpng, SQLite, and GNU Binutils.

Here are the headline results:

  • SAILOR discovered 379 distinct, previously unknown memory-safety vulnerabilities (421 confirmed crashes).
  • The strongest competing approach they tested—an advanced AI agent with full codebase access—found only 12 vulnerabilities.
  • Each part of SAILOR mattered:
    • Without static analysis to aim the search, confirmed bugs dropped by about 12 times.
    • Without the LLM’s step-by-step harness building, zero vulnerabilities were confirmed.
    • Without symbolic execution, none of the approaches could find more than 12 real vulnerabilities.

Why this is important:

  • Memory-safety bugs (like buffer overflows and use-after-free) often lead to serious security problems. Finding them precisely and at scale is hard.
  • SAILOR confirms bugs by actually crashing the real program, which cuts down on false alarms and gives developers concrete proof and inputs to reproduce the issue.
  • It works on huge codebases where manual testing would be too slow or too complicated.

A quick example:

  • In GNU Binutils (a very large project), SAILOR found a place where the program copied data without checking that the destination had enough space. The LLM created a harness that set up the right structures; symbolic execution found exact input sizes that overflowed the buffer; and AddressSanitizer confirmed the real crash. That’s the full loop: suspect → targeted test → proof → real crash.

What’s the bigger impact?

  • Faster, more reliable security testing: SAILOR automates the hard, expert-only parts of symbolic execution, making it practical for very large projects.
  • Fewer false positives: Because every reported bug is confirmed by running the real program, developers get trustworthy results they can act on.
  • Extensible: Although this paper focused on memory-safety bugs, the same idea—static analysis for targets, LLMs for harness building, symbolic execution for proof, and real-world validation—could be adapted to other kinds of software bugs.
  • Practical limits: Some projects with very complex setup steps (for example, complicated library initializations) still challenged the system, so there’s room for future improvements.

In short, SAILOR is like a well-organized team: a map reader to find suspects, a builder to set up the right tests, an explorer to prove the path to a bug, and a referee to confirm the crash. Together, they make finding real, serious bugs in massive codebases much more achievable.

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper, aimed to guide future research.

  • Coverage beyond memory safety: The approach and query suite target only memory-safety issues in C/C++; applicability to other bug classes (e.g., logic errors, race conditions, concurrency/atomicity violations, resource leaks) and other languages (e.g., Rust, Go, Java, C#) is not evaluated.
  • Ground-truth recall and false negatives: The paper reports the number of discovered bugs but does not quantify recall against a known ground-truth set or cross-check how many true SA findings were missed by SE+LLM (e.g., how many of 87,385 SA findings were true positives that SAILOR failed to confirm).
  • Precision/recall of the CodeQL rule suite: The 34-query suite’s detection quality (per-rule precision/recall, overlap with standard rules, impact of custom rules) is not measured; how rule tuning affects downstream SE success is unknown.
  • Entrypoint selection heuristics: The LLM_INFER strategy for selecting entrypoints via call-graph ascent is not compared to alternatives; its impact on reachability and missed vulnerabilities remains unquantified.
  • Stub soundness and over-approximation risks: Loop-to-if and branch neutralization (if(0)/if(1)), type-level reductions, and symbolic-return stubs may admit infeasible paths or suppress safety-critical invariants; there is no formal guarantee or empirical study of how often stubbing leads to false positives/negatives.
  • Harness realism for stateful systems: Failures on curl, OpenSSL, and SQLite highlight gaps in modeling multi-step API initialization, protocol/state machines, cryptographic context setup, and persistent DB state; techniques to automatically infer and instantiate realistic state (e.g., protocol traces, state summaries, learned object lifecycles) are needed.
  • Concrete replay fidelity: ASan replay failed for cases requiring complex internal state; the paper does not explore methods to reconstruct or validate state beyond byte-level inputs (e.g., using path constraints to synthesize higher-level states, transactional setup for DBs, or environment scaffolding).
  • Validation scope and severity: Confirmations rely on ASan crash presence without severity/exploitability assessment (e.g., controllability of write, impact on confidentiality/integrity/availability); integrating exploitability ranking or CVE triage is left open.
  • Deduplication criteria: Deduplication by (file, function, line) may under/over-merge distinct vulnerabilities at the same line or the same bug manifesting across sites; alternative canonicalization (e.g., slice-level or constraint-level signatures) is not explored.
  • LLM dependence and reproducibility: Results hinge on a proprietary model (gpt-5-0806); sensitivity to model choice, temperature/seeding, prompt variants, and non-determinism is not studied; reproducibility across LLMs and releases is unclear.
  • Cost and scalability profiling: Token usage is reported but total compute time, CPU/GPU energy cost, per-bug wall-clock cost, and scaling characteristics with project size/complexity (e.g., KLEE timeouts, path explosion) are not quantitatively profiled.
  • Sensitivity analysis of orchestrator parameters: The impact of Tmax, Texplore, Tklee, depth limits, and search strategies (random-path vs. depth-first) on detection rates and costs is not evaluated; adaptive budgets or auto-tuning are unexplored.
  • Alternative SE engines and configurations: The pipeline is bound to KLEE with fixed settings; comparative evaluation with other symbolic engines (e.g., Symbiotic, angr, S2E), solver back-ends, or search heuristics is absent.
  • Static analysis–SE coupling: How different SA sources (e.g., Clang Static Analyzer, Infer) or richer traces (e.g., interprocedural summaries) alter harness synthesis success is not investigated.
  • Artifact generalization and portability: The system targets Linux/Clang; portability across compilers, linkers, build systems (e.g., MSVC, Bazel, CMake variants), platforms (Windows/macOS), and dynamic libraries is not assessed.
  • Handling of macros/templates and complex C++ features: Robustness of code slicing and type-level stubs with heavy macro usage, templates, inline functions, and constexpr logic is not characterized.
  • Environment and external I/O modeling: System calls, file/network I/O, and complex external dependencies are stubbed; methods for realistic environment emulation (e.g., filesystem snapshots, network protocol states, syscall models) are left open.
  • Template generality for assertions: Per-CWE templates cover a subset of memory-safety patterns; coverage gaps (e.g., nuanced integer overflows, pointer aliasing constraints) and systematic extension/verification of templates are not discussed.
  • Triage of SA finding filters: File/function skip patterns may remove actionable targets; their effect on recall and how to tune filters per project is not evaluated.
  • Post-discovery processes: Vulnerability disclosure to maintainers, patch validation, regression testing, and longitudinal tracking (e.g., whether issues are fixed upstream) are not integrated or studied.
  • Multi-bug interactions and path dependencies: The approach treats each spec independently; orchestration for interacting vulnerabilities or chained conditions (e.g., precondition bug enabling a deeper bug) is not explored.
  • Metrics for harness quality: There is no metric or automated check for harness sufficiency/realism (beyond compilation and reachability); defining and measuring harness quality remains open.
  • Learning from failures: Refinement logs categorize compile/SE diagnostics, but systematic analysis of failure modes (e.g., incomplete type resolution, conflicting prototypes, unreachable sinks) and targeted remediation strategies is missing.
  • Security policy/attacker model: The paper assumes crashing inputs suffice for validation; a formal attacker model (privilege, input channels, assumptions) and policy on what constitutes a security-relevant finding is not defined.
  • Data and tooling release: Availability of code, queries, specs, harnesses, and reproducibility artifacts (including prompts and seeds) is not specified; without this, independent replication is limited.

Practical Applications

Overview

The paper introduces SAILOR, a three-phase pipeline that combines static analysis (CodeQL), LLM-orchestrated harness synthesis, symbolic execution (KLEE), and concrete replay (AddressSanitizer) to automatically discover and validate memory-safety vulnerabilities in large C/C++ codebases. It found 379 previously unknown vulnerabilities across 6.8M LOC, vastly outperforming LLM-only and other baselines. Below are practical applications that draw directly from SAILOR’s findings, methods, and innovations.

Immediate Applications

The following use cases can be implemented today with existing tools and reasonable engineering effort, especially for C/C++ projects that can be built to LLVM bitcode and run under ASan.

  • Automated vulnerability triage and discovery for C/C++ codebases
    • Sectors: software, cybersecurity services, open-source
    • Workflow/product: “SAILOR-as-a-Service” scanner that ingests repositories + build scripts, runs CodeQL → LLM harness synthesis → KLEE → ASan replay, and emits evidence packs per bug (specification, witness inputs, path constraints, crash traces)
    • Assumptions/dependencies: Buildable project with compile_commands.json; LLVM/Clang toolchain; CodeQL DB build; access to a strong LLM; compute budget for LLM+KLEE runs; memory-safety focus
  • CI/CD guardrails for memory safety
    • Sectors: software, finance, healthcare, energy, embedded/robotics
    • Workflow/product: GitHub Actions/GitLab CI job that runs nightly/PR-scoped scans; fails the pipeline on ASan-confirmed crashes; opens issues with reproducible drivers and inputs
    • Assumptions/dependencies: Deterministic builds; PR or nightly budget (time/tokens); sanitizer-enabled builds; per-project turn/token limits to control cost
  • Prioritization of static analysis alerts with evidence
    • Sectors: software, enterprise AppSec
    • Workflow/product: Integrate SAILOR outputs with SAST dashboards to rank CodeQL findings by “ASan-confirmed” vs. “site-reached/no-crash” vs. “unreachable,” reducing alert fatigue and focusing remediation
    • Assumptions/dependencies: CodeQL query suite configured for the org; connector to existing SAST/SIEM tools
  • Evidence-backed bug bounty triage
    • Sectors: cybersecurity, open-source foundations
    • Workflow/product: Use SAILOR to reproduce/validate third-party reports; attach crash traces and concrete inputs to disclosure; reject unconfirmed claims confidently
    • Assumptions/dependencies: Legal/process alignment for handling PoCs; ability to build and instrument reported versions
  • Fuzzing bootstrapping and seed generation
    • Sectors: software, security research
    • Workflow/product: Feed KLEE-produced witness .ktest inputs into libFuzzer/AFL as seeds; add replay drivers as harnesses to expand coverage; recycle failing seeds into regression suites
    • Assumptions/dependencies: Fuzz harness availability or minor refactoring to accept SAILOR’s witness inputs; sanitizer-enabled fuzz builds
  • Regression test generation from verified crashes
    • Sectors: software across domains using C/C++
    • Workflow/product: Transform concrete replay drivers + inputs into unit/integration tests to prevent reintroduction; store in regression test suites
    • Assumptions/dependencies: Stable APIs to host tests; handling of environment dependencies in CI
  • Supply-chain dependency screening
    • Sectors: finance, healthcare, government procurement, ISVs
    • Workflow/product: Scan third-party C/C++ libraries in SBOMs prior to adoption; attach evidence packs to risk assessments; gate inclusion on zero ASan-confirmed issues
    • Assumptions/dependencies: Access to source and build scripts of dependencies; standardized artifact ingestion; legal permissions to test
  • Code-review copilots for memory-safety hotspots
    • Sectors: software
    • Workflow/product: IDE extensions that run Phase 1 (CodeQL) and surface vulnerability specifications + assertion templates inline; offer one-click local run of SAILOR on modified files/functions
    • Assumptions/dependencies: Developer machine support for CodeQL and local KLEE/ASan; LLM/API access with cached prompts
  • Academic teaching and reproducible research packs
    • Sectors: academia, education
    • Workflow/product: Distribute SAILOR’s evidence packs (specs, path constraints, witness inputs, crash traces) as lab materials for courses on program analysis, SE, and secure coding
    • Assumptions/dependencies: Licensing/redistribution rights for code and artifacts
  • Targeted scanning for embedded/IoT firmware source trees
    • Sectors: robotics, IoT, automotive (userland libraries), medical devices (non-kernel)
    • Workflow/product: Build host-sanitized variants of libraries and run SAILOR to catch memory issues early in development; export verified tests to embedded regressions
    • Assumptions/dependencies: Host-buildable components or simulator builds; ASan-compatible builds (userland), LLVM toolchain; modeling for hardware-dependent stubs may be limited

Long-Term Applications

These applications require further R&D, scaling, or ecosystem changes (e.g., broader language support, more robust environment modeling, or standardization).

  • Generalizing beyond memory-safety vulnerabilities
    • Sectors: software, safety-critical systems
    • Tools/products: Extend per-CWE assertion templates (e.g., integer overflows, logic violations) and integrate additional oracles (ThreadSanitizer for races, UBSan for undefined behavior)
    • Assumptions/dependencies: Precise templates for new CWEs; SE support for concurrency and scheduling; richer environment models
  • Automatic remediation and patch synthesis
    • Sectors: software, open-source
    • Tools/products: Couple SAILOR’s witness constraints with LLM-based patch generation; auto-suggest PRs with test cases; verify fixes by re-running SE/ASan
    • Assumptions/dependencies: High-fidelity fix suggestion models; maintainers’ acceptance; guard against regressions and performance regressions
  • Language and platform expansion
    • Sectors: mixed-language systems, mobile, kernels
    • Tools/products: Support for Rust unsafe blocks, Go/Cgo, and binary-only targets via lifters (e.g., McSema/Ghidra to LLVM IR); kernel-mode sanitizers or hypervisor-based replay
    • Assumptions/dependencies: High-quality lifting and debug info; sanitizer support in new environments; refined stubbing for syscalls/ABI boundaries
  • Ecosystem-scale scanning (distros and package registries)
    • Sectors: public sector, cloud platforms, OS vendors
    • Tools/products: Clustered SAILOR deployments scanning Debian/Yocto ecosystems; continuous monitoring for new commits; dashboards for maintainers and downstream consumers
    • Assumptions/dependencies: Compute orchestration at scale; incremental builds; cost controls for LLM tokens and SE timeouts
  • Evidence standards for procurement and certification
    • Sectors: government, defense, healthcare, automotive, aerospace
    • Policies/workflows: Require “verifiable evidence packs” (specs + replay + ASan trace) in software deliverables; map to ISO 26262, DO-178C, FDA premarket guidance
    • Assumptions/dependencies: Regulator and standards-body acceptance; standardized formats and audit workflows
  • Continuous, incremental SE in CI (change-based targeting)
    • Sectors: software at scale
    • Tools/products: Integrate incremental static analysis to target only changed functions and call chains; run micro-harnesses per PR with minute-level budgets
    • Assumptions/dependencies: Precise diff-to-callgraph mapping; caching of exploration context; faster SE engines or parallelism
  • Stronger environment modeling and domain-specific stubs
    • Sectors: networking (TLS, HTTP), databases, multimedia
    • Tools/products: Libraries of vetted stubs/models for common subsystems (OpenSSL EVP pipelines, SQLite pager/schema setup), enabling deeper reach into complex APIs where SAILOR struggled (e.g., curl/OpenSSL/SQLite)
    • Assumptions/dependencies: Community-maintained model repositories; validation to avoid unrealistic behaviors
  • Integration with SBOM/SCA and risk scoring
    • Sectors: insurance, compliance, supply-chain security
    • Tools/products: Combine confirmed bug density and severity with SBOMs to compute supplier risk; incentivize vendors to submit SAILOR evidence in security scorecards
    • Assumptions/dependencies: Industry-accepted risk models; data-sharing agreements; privacy-preserving aggregation
  • Training data for next-gen code models
    • Sectors: AI/ML, developer tools
    • Tools/products: Use SAILOR’s verified pairs (buggy context, witness constraints, crash traces) to train LLMs that better reason about concrete execution and memory safety
    • Assumptions/dependencies: Curated, license-compliant datasets; model architectures that leverage path constraints and oracles
  • Developer-facing “live” SE assistants
    • Sectors: software
    • Tools/products: IDE-integrated, on-demand harness synthesis and targeted SE on edited functions with sub-10s turnaround; immediate feedback on safety properties before commit
    • Assumptions/dependencies: Lightweight SE engines or cloud offload; aggressive caching; ergonomic UX for developers
  • Binary and closed-source auditing via hybrid lifting and emulation
    • Sectors: third-party audits, ICS/critical infrastructure
    • Tools/products: Apply SAILOR to binaries using lifting + emulated environment models to detect memory-safety issues in closed-source dependencies
    • Assumptions/dependencies: Robust binary lifters; accurate emulation for syscalls/IO; legal/contractual permissions

Notes on Feasibility and Limitations

  • Scope: SAILOR currently targets C/C++ memory-safety issues; extension to other languages and bug classes needs additional templates, models, and oracles.
  • Buildability: Requires projects to compile to LLVM bitcode and to support ASan builds; kernel/driver code and some embedded targets may need alternate strategies.
  • Environment complexity: Complex multi-step APIs (curl/OpenSSL/SQLite) remain challenging without domain-specific models; expect gaps until robust stub libraries mature.
  • Cost and performance: LLM token consumption and SE timeouts must be managed (budgets, parallelization, prioritization); incremental, change-based workflows can improve ROI.
  • Correctness and realism: Automated stubs must avoid unrealistic behaviors; Phase 3 replay mitigates false positives but may fail when SE produces states not reconstructible in real builds.
  • Security and compliance: Running untrusted code under sanitizers demands sandboxing; sharing evidence packs must respect licensing and sensitive data policies.

By combining scalable static targeting, LLM-guided harness synthesis, precise path exploration, and independent concrete validation, the SAILOR paradigm enables practical, evidence-based vulnerability discovery that can be deployed today in CI and security workflows, and evolved into broader assurance and remediation platforms over time.

Glossary

  • AddressSanitizer: A compiler-based runtime tool that detects memory errors like buffer overflows and use-after-free by instrumenting code. "Then the unmodified project source is compiled with AddressSanitizer (-fsanitize=address), producing an instrumented static archive (.a)."
  • Agentic vulnerability detection: An autonomous LLM-driven approach that plans and executes code analysis actions to find vulnerabilities. "(agentic vulnerability detection using Claude Code with full codebase access and unlimited interaction)"
  • ASan: The runtime component of AddressSanitizer that reports memory safety violations during execution. "A crash is classified as confirmed only if the ASan stack trace reports a memory safety violation inside the project's source code."
  • Assertion template: A parameterized safety condition (often per CWE) that guides how to encode and check a property during analysis. "The vulnerability specification includes a per-CWE assertion template (Table 2) that guides the LLM in encoding the safety property during Phase 2 (§4.3)."
  • Call chain: The sequence of function calls from an entry function to a target site. "The LLM identifies conditional statements along the call chain from e to fo that cause early exit before reaching { (e.g., if (p == NULL) return)."
  • Call graph: A directed graph representing calling relationships between functions in a program. "the orchestrator searches the call graph upward from the vulnerable function fo to find the nearest non- static caller as an initial e"
  • Code slice: A minimal, self-contained subset of code that preserves the path to a target statement while stubbing unrelated parts. "The LLM constructs a self-contained code slice: A C file con- taining only the code along the call chain from e to {, with all external dependencies replaced by stubs."
  • CodeQL: A query language and platform for semantic code analysis used to find vulnerability patterns. "SAILOR builds a CodeQL database from the project's source and build configuration"
  • Concrete replay: Re-executing a program with concrete inputs (derived from symbolic execution) on the unmodified code to validate findings. "concrete replay validates the symbolic execution results against the unmodified project source."
  • Concrete validation: The final phase that confirms symbolic findings by running real binaries with sanitizers. "Concrete validation (§5): Witness inputs are replayed against the unmodified project source under AddressSanitizer, producing a verdict independent of both the LLM and the harness"
  • Constraint solver: A tool that determines if logical constraints are satisfiable and can produce concrete assignments for symbolic values. "and it uses a constraint solver to produce concrete wit- ness inputs that formally confirm or refute a property violation."
  • Data-flow trace: An ordered list of program points tracking how data moves from a source to a sink. "Data-flow trace t = (s, ... , {): an ordered sequence of program points from a data-flow source s to the sink at {."
  • Deduplicated: The process of merging multiple crash reports that correspond to the same underlying bug location. "dedu- plicated by file, function, and line"
  • Dual-strategy search: A symbolic execution search policy combining two strategies to balance coverage and depth. "KLEE explores the bitcode with dual-strategy search (random-path + depth-first), a 300 s per-run timeout, and a depth limit of 1,000."
  • Entrypoint: The function selected as the starting point for analysis or symbolic execution. "Determines which project function serves as the symbolic execution entrypoint e."
  • Fact pack: A bundle of extracted code-level hints (e.g., suspect calls, pointers, lengths) augmenting a static finding. "SAILOR enriches it into a fact pack of code-level hints"
  • Fuzzing: Automated testing that mutates inputs to trigger unexpected behavior or crashes. "Fuzzing tools [11, 26] com- plement SA by exercising the program with concrete inputs, yet they struggle to reach deep library internals that require precisely structured program state."
  • Guard conditions: Predicates that must be satisfied to pass early checks that would otherwise prevent reaching a target site. "Guard conditions: The LLM identifies conditional statements along the call chain from e to fo that cause early exit before reaching { (e.g., if (p == NULL) return)."
  • Harness: A driver and associated scaffolding that sets up program state, models the environment, and encodes assertions for analysis. "which automates symbolic execution harness construction by combining static analysis with LLM-based synthesis."
  • Heap-buffer-overflow: A memory safety error where writes or reads go beyond the bounds of a heap allocation. "ASan confirms a heap-buffer-overflow in_bfd_x86_elf_late_size_sections, reading 8 bytes past a 4-byte allocation."
  • Inter-procedural: Spanning across function boundaries, often referring to analysis that tracks behavior across calls. "Inter-procedural queries (e.g., use-after-free: s is the free call, { is the subsequent dereference)"
  • KLEE: A symbolic execution engine for LLVM bitcode that explores program paths and finds bugs with concrete test cases. "KLEE explores the bitcode with dual-strategy search (random-path + depth-first), a 300 s per-run timeout, and a depth limit of 1,000."
  • klee_assert: A KLEE intrinsic that triggers an assertion failure when its condition is false to mark a path as reaching a sink. "A klee_assert (0) is placed immedi- ately after the vulnerable statement in the code slice."
  • klee_assume: A KLEE intrinsic that constrains symbolic variables by assuming a condition holds, pruning infeasible paths. "the driver encodes klee_assume (-c) to bypass the early-exit path."
  • klee_make_symbolic: A KLEE intrinsic that marks memory as symbolic so the executor explores multiple concrete valuations. "Fields that ap- pear in the vulnerability condition are declared as symbolic via klee_make_symbolic"
  • klee_warning_once: A KLEE helper that logs a message once per location, often used for coverage probes. "the orchestrator reports which functions in the call chain were entered (via klee_warning_once coverage probes) and which were not"
  • LLVM bitcode: The intermediate representation used by LLVM toolchain and KLEE for analysis and execution. "H is compiled to LLVM bitcode (clang -O0 -g) and linked into a single module (llvm-link)."
  • llvm-link: An LLVM tool that links multiple bitcode files into a single module. "H is compiled to LLVM bitcode (clang -O0 -g) and linked into a single module (llvm-link)."
  • LLM-orchestrated SE: Symbolic execution guided and iteratively refined by a LLM using feedback loops. "LLM-orchestrated SE (§4): For each vulnerability specification generated in the first phase, an LLM iteratively synthesizes a harness"
  • LLMs: LLMs used for code reasoning, synthesis, and orchestration in the analysis pipeline. "Recently, LLMs have been applied to vulnerability detection [10, 17, 30? ], but their outputs carry no formal correctness guarantees."
  • Null-pointer dereference: An error from dereferencing a pointer that is NULL, leading to a crash or undefined behavior. "Two null checks (lines 2696-2697) guard against null-pointer dereferences but do not enforce that size is within the bounds of contents."
  • OSS-Fuzz: A continuous fuzzing service for open-source software, providing infrastructure and harnesses. "We collect existing OSS- Fuzz harnesses and convert them to KLEE-compatible drivers by replacing the fuzz inputs with symbolic variables."
  • Path explosion: The rapid growth of paths explored in symbolic execution due to branching, limiting scalability. "Path explosion limits the depth of exploration"
  • Random-path: A search heuristic in symbolic execution that randomly selects paths to explore. "dual-strategy search (random-path + depth-first)"
  • SARIF: The Static Analysis Results Interchange Format for representing and sharing static analysis findings. "SARIF (Static Analy- sis Results Interchange Format) [25]"
  • Stack-buffer-overflow: A memory safety error where stack-allocated buffers are accessed out of bounds. "stack-buffer-overflow (15, 4%)"
  • Symbolic execution: Program analysis that treats inputs as symbolic values and explores paths using constraints to find concrete counterexamples. "Symbolic execution (SE) [7, 18] offers certain advantages"
  • Type confusion: Using a memory region as a different type than it was allocated for, often leading to invalid accesses. "(iii) lifetime and type-confusion patterns, including dangling pointers after free and use of memory reclaimed by a different type."
  • Use-after-free: Accessing memory after it has been freed, a dangerous temporal memory safety bug. "use-after-free (56, 13%)"
  • Witness inputs: Concrete test inputs produced by the solver that demonstrate a property violation. "KLEE detects a memory safety violation at { and produces concrete witness inputs (. ktest files)."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 5 likes about this paper.