Papers
Topics
Authors
Recent
Search
2000 character limit reached

SmartOracle: Modular LLM Oracle

Updated 28 January 2026
  • SmartOracle is a modular, agentic LLM-driven oracle designed to decompose differential fuzzing triage tasks, enhancing true bug detection in underspecified environments.
  • It employs specialized LLM sub-agents, such as discrepancy finders and specification checkers, to collaboratively gather evidence and reduce manual filtering.
  • Empirical evaluations show SmartOracle achieving 84% recall and 4× faster triage than traditional rule-based approaches, significantly cutting costs.

SmartOracle is a modular, agentic LLM-driven oracle designed to reduce noise and manual effort in the triage of divergent results discovered through differential fuzzing, particularly targeting complex, underspecified ecosystems such as JavaScript engines. Instead of relying on brittle, manual rule-based filters, SmartOracle decomposes triage into collaborative evidence-gathering and reasoning tasks performed by specialized LLM sub-agents. This architecture improves true-bug detection, reduces false positives, and substantially accelerates and economizes the process of differential bug discovery and reporting (Srinivasan et al., 21 Jan 2026).

1. Differential Fuzzing and the Oracle Problem

Differential fuzzing executes identical inputs on two or more systems under test to detect behavioral divergences suggestive of specification violations. Automated oracles are essential: after each flagged divergence, they decide if the mismatch reflects a true specification violation (bug) or a benign difference arising from implementation-defined behavior, formatting, or optimizations. Naive oracles that conservatively treat every difference as a bug generate overwhelming noise, while manually constructed filters are expensive to author, highly fragile, and have to be re-engineered whenever the specification or implementations evolve. The JavaScript ecosystem exemplifies these problems: ECMAScript leaves many behaviors intentionally underspecified, introducing substantial ambiguity into automated oracle construction (Srinivasan et al., 21 Jan 2026).

2. Manual Oracle Maintenance Challenges

Manual oracle construction for JavaScript engines is hampered by three principal issues:

  • Specification Complexity and Underspecification: Many ECMAScript features allow or require implementation-defined behavior (e.g., Function.prototype.toString()), making output normalization infeasible using simple syntactic checks. Divergences not only arise from bugs but also from valid inter-engine variability.
  • False Positives from Benign Divergence: Differences in formatting, property order, or engine-specific optimizations are common among prominent engines such as V8, SpiderMonkey, JavaScriptCore, and GraalJS. These benign differences can overwhelm naive filtering approaches.
  • High Cost of Rule-Based Filters: Manually coding, maintaining, and adapting rule-based filters across evolving engines and emerging language features is labor-intensive, error-prone, and typically lags behind the specification and implementation changes (Srinivasan et al., 21 Jan 2026).

3. SmartOracle Architecture and Agentic Workflow

SmartOracle replaces rule-centric approaches with a multi-agent LLM orchestrator that decomposes decision-making into well-scoped, interacting sub-agents, each leveraging targeted tools and knowledge sources. The key components are:

  • Discrepancy Finder: Normalizes and structures divergences, proposes likely root causes, and extracts minimized reproducible test cases. Employs access to terminal(engine_name, code) for re-execution and spec(query) for retrieving specification fragments.
  • Specification Checker: Grounds analysis in authoritative ECMA-262 passages by extracting relevant text directly from a cached specification, eschewing heuristic matching.
  • Duplicate Analyzer: Uses top-k similarity searching to suppress redundant reporting of already known or previously suppressed issues.
  • False Positive Critic: Examines agent reasoning for over-reporting and hallucination, acting as a guardrail against false positives, especially under underspecified or implementation-defined behavior.
  • Reasoning Confidence Checker: Quantifies confidence (in [0.0, 1.0]) based on the joint evidence amassed by all agents.
  • Test Case Minimizer: Further reduces the input to a minimal form that still reproduces the divergence, enhancing clarity and reportability.

Orchestrator Workflow: The orchestrator structures the divergence, sequences sub-agent invocations, aggregates findings, and makes a final report-or-skip decision based on confidence thresholds and formal specification checks (Srinivasan et al., 21 Jan 2026).

4. Empirical Evaluation and Comparative Analysis

Historical Benchmarks

SmartOracle was evaluated on datasets including Park et al. (2021; 44 bugs), Lima et al. (2021; 16 bugs), and 238 manually labeled in-house findings. Performance metrics:

  • Recall (True Bug Detection Rate): 0.84 (Park set), ∼0.75 (Lima set), ∼0.73 (manual set)
  • False Positive Rate: 18% on a held-out set of 136 known non-bugs

Baseline Comparison

Method Recall Time/Case (s) Tokens/Case (×10³) Cost/Case (USD)
SmartOracle 0.84 20 6.5 0.003
LRM + CoT 0.68 91 14.7 0.04

SmartOracle achieves +16 points higher recall, 4× faster triage, and ~10× reduced cost compared to a chain-of-thought Gemini 2.5 Pro baseline.

Active Fuzzing Campaigns

Using KITTEN + DIE corpus seeds over 48 hours against V8, SpiderMonkey, JavaScriptCore, and GraalJS, SmartOracle's triage surfaced eight unique, previously unreported specification-level issues (four subsequently confirmed and/or fixed). The system demonstrated efficacy in surfacing parsing ambiguities, TypeError enforcement flaws, and undocumented behaviors (Srinivasan et al., 21 Jan 2026).

5. Generalization Potential and Transferability

SmartOracle's agentic orchestration, which tightly integrates specification-grounded analysis with dynamic multi-engine executions, provides a generalizable workflow. Potentially transferability includes:

  • Other dynamic language runtimes (Python, Ruby, PHP, WebAssembly)
  • Compiler toolchains (C/C++, Rust, where undefined behavior complicates rule-based oracles)
  • Database systems, network protocol stacks, cryptographic libraries—essentially any domain where differential fuzz testing's scalability is bottlenecked by manual oracle construction (Srinivasan et al., 21 Jan 2026).

6. Quantitative Performance Metrics

  • Recall: recall=TPTP+FN\mathrm{recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}
  • False Positive Rate: FPR=FPFP+TN\mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP} + \mathrm{TN}}
  • 4× triage speedup and 10× API cost reduction compared to sequential large-model CoT baselines.

These results are achieved through decomposition of reasoning into specialized, less expensive LLM invocations using smaller "Flash" models, synthesized by the orchestrator for final verdicts.

7. Implications, Lessons, and Future Directions

SmartOracle demonstrates that decomposing manual triage into focused, specification-grounded, tool-augmented agentic sub-tasks can substantially mitigate noise and manual effort in high-noise differential oracle settings. By displacing brittle hand-coded logic with orchestrated LLM sub-agents, SmartOracle achieves high recall, reduced false positives, and dramatic improvements in time and cost efficiency—setting a foundation for scalable, maintainable oracle construction across diverse, complex software ecosystems. The agentic orchestration model suggests a research trajectory for applying this workflow to other specification-rich, rapidly evolving domains where differential fuzzing is constrained by oracle scalability (Srinivasan et al., 21 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SmartOracle.