Papers
Topics
Authors
Recent
Search
2000 character limit reached

PoCEvaluator Protocol for DeFi Exploits

Updated 4 February 2026
  • PoCEvaluator Protocol is an independent evaluation framework for verifying the correctness and quality of PoC tests reproducing DeFi exploits on forked EVM chains.
  • It employs multi-agent checks and an aggregator-guided negotiation to enforce deterministic execution, semantic validation, and audit-ready verdicts.
  • Empirical metrics, including reproduction rates and code quality measures, drive continuous refinement and robust postmortem analyses.

PoCEvaluator is an agentic, independent evaluation protocol designed to assess the correctness and quality of executable Proof-of-Concept (PoC) tests for decentralized finance (DeFi) exploits, as described in the context of the TxRay postmortem system for ACT (Anyone-Can-Take) exploits. After a candidate PoC—typically a Foundry test that replays a blockchain exploit on a forked EVM chain—is synthesized, PoCEvaluator executes a series of automated, multi-agent checks to ensure the PoC is not only functionally correct but also adheres to rigorous quality criteria, with verdicts that can drive further PoC refinement. The protocol explicitly addresses reproducibility, semantic validation, code artifact hygiene, and auditability on incident-specific, forked EVM states (Wang et al., 1 Feb 2026).

1. Conceptual Overview and Objectives

PoCEvaluator exists to enforce rigorous standards for validating and auditing PoCs that reproduce DeFi exploits. Its primary goals are threefold: (i) independently verify that PoCs deterministically reproduce the underlying exploit via semantic oracles on a live chain fork, (ii) statically and dynamically inspect code for quality features such as self-containment and absence of hard-coded attacker artifacts, and (iii) provide adjudicated, auditable verdicts that mitigate individual agent bias or hallucination. As situated within the TxRay system, PoCEvaluator operates as an agentic test-and-review harness. Its verdicts are critical: they determine pass/fail outcomes and supply structured feedback to downstream refinement loops.

2. Architecture: Agents, Toolchain, and Execution Environment

The PoCEvaluator protocol leverages agentic redundancy and execution isolation to maximize rigor:

  • Sub-evaluator Agents (N=3 by default): Each is a LLM-backed agent instantiated in an isolated environment with access to (a) the full PoC code directory (including test/Exploit.sol and foundry.toml), (b) a local Foundry toolchain (forge test), and (c) on-chain data via Etherscan v2, QuickNode RPC endpoints (archive/debug_traceTransaction, eth_call, storage reads), and the cast suite for targeted state queries.
  • Aggregator Agent: An LLM that collects individual agents' structured (JSON) reports, identifies metrics with disagreement, and orchestrates focused, multi-round negotiation by prompting only disagreeing evaluators to reconsider their verdicts with contextualized evidence (forge logs, code snippets, and API results).
  • Sandboxed Execution Environment: Each agent assesses candidate PoCs against a pinned EVM chain fork at the incident's block height. All assessments operate on this live on-chain fork to eliminate model hallucination and guarantee that oracle checks inspect authentic state-delta effects.

This ensemble is purpose-built to reduce evaluator bias, eliminate single-point-of-failure, and ensure each metric draws from authoritative, tool-mediated evidence.

3. Evaluation Dimensions: Correctness Oracles and Quality Metrics

PoCEvaluator applies a dual-axis assessment: correctness and quality.

  • Correctness Oracles:
    • C1: The PoC must compile under Foundry.
    • C2: The PoC must execute on the forked chain without unexpected reverts and must satisfy all incident-specific semantic assertions (semantic oracles), corresponding to exact exploit success predicates such as invariant violations or profit thresholds.
    • C3: The PoC must demonstrably execute on a deterministic fork of live on-chain state, eschewing pure local mocks.
  • Quality Metrics:
    • Q1–Q6: These span self-containment, code readability, elimination of attacker artifacts (addresses or transaction hashes), avoidance of exploit-specific hard-coded constants, and presence of explicit success predicates. These criteria are statically and dynamically evaluated through LLM-driven analysis and targeted on-chain queries (e.g., using Etherscan APIs to check address provenance).

Correctness oracles function as hard gating criteria for exploit reproduction, while quality metrics foster best practices in PoC engineering and ensure outputs serve as robust postmortem artifacts.

4. Execution Flow: Turn-by-Turn Evaluation and Multi-Agent Negotiation

The PoCEvaluator workflow is a two-stage pipeline:

  • Stage i: Independent Agentic Evaluation

    1. The orchestrator provides the PoC directory and semantic oracle definitions to each sub-evaluator agent.
    2. Each agent compiles the PoC (forge test), monitors execution, tracks pass/fail status for all oracles, confirms fork targeting, and applies static analysis for quality heuristics.
    3. Each agent emits a structured per-metric (C1–C3, Q1–Q6) JSON report, providing a Boolean outcome and a concise reason for each metric.
  • Stage ii: Aggregator-Guided Multi-Round Negotiation

    1. The aggregator consolidates initial reports and identifies any metric-level disagreements.
    2. For each point of contention, only the disagreeing agent(s) receive contextual evidence and are prompted to "Maintain" or "Change" their evaluation.
    3. This negotiation process iterates until all metric verdicts are unanimous or a fixed negotiation budget (e.g., five rounds) is exhausted.
    4. The outcome is a consolidated JSON verdict log, including a lineage of negotiation rounds for audit traceability.

This process ensures robustness by requiring explicit agent consensus and providing a granular, explorable record of evaluation rationale.

5. Formal Metrics, Reproducibility, and Auditing

Empirical evaluation is based on the following variables:

  • NtotalN_{\text{total}}: Total number of evaluated PoCs/incidents.

  • nrepron_{\text{repro}}: Number of PoCs passing all C1∧C2∧C3 (full correctness).
  • navoid_addrn_{\text{avoid\_addr}}: Number passing absence of real attacker addresses (Q2).
  • navoid_hcoden_{\text{avoid\_hcode}}: Number passing avoidance of exploit-specific hard-coded constants (Q3).
  • coveragebaselinecoverage_{\text{baseline}}: Incidents processed by a comparative baseline tool.
  • coveragetxraycoverage_{\text{txray}}: Incidents processed by TxRay.

Key rates:

  • Reproduction Rate: Rrep=nrepro/NtotalR_{\text{rep}} = n_{\text{repro}} / N_{\text{total}}
  • Address-Hardcoding Avoidance Rate: Raddr=navoid_addr/NtotalR_{\text{addr}} = n_{\text{avoid\_addr}} / N_{\text{total}}
  • Hard-Code Avoidance Rate: Rhcode=navoid_hcode/NtotalR_{\text{hcode}} = n_{\text{avoid\_hcode}} / N_{\text{total}}
  • Coverage Improvement: Δcov=(coveragetxray−coveragebaseline)/coveragebaseline\Delta_{cov} = (coverage_{txray} - coverage_{baseline}) / coverage_{baseline}

Empirically, on a dataset of 114 ACT incidents from DeFiHackLabs, PoCEvaluator in the TxRay pipeline achieved Rrep=92.11%R_{\text{rep}} = 92.11\% (105/114), Raddr=98.1%R_{\text{addr}} = 98.1\% (103/105), and substantial improvement in avoidance of hard-coded constants and explicit success predicates over baselines. Each sub-evaluator's test evaluation median was 1.2s, with overall wall-clock time for a full evaluation under a minute.

6. Correctness Guarantees, Limitations, and Auditing

PoCEvaluator's multi-agent negotiation reduces the risk of false positives and negatives inherent to single-pass or single-agent evaluation. By executing PoCs on real on-chain forks and validating semantic oracles, the system enforces correspondence between the PoC and actual exploit mechanics, not merely superficial logics such as runtime completion. The final consolidated JSON, including negotiation history, creates an auditable corpus for manual or programmatic review. This pipeline matched expert human assessments with 100% concordance on a stratified 20-sample subset.

Limitations include dependency on EVM chains with robust fork/tracing infrastructure. Exploits involving off-chain socio-technical components cannot be validated by PoCEvaluator. Toolchain bottlenecks (notably RPC/explorer calls) are a latency constraint, though caching and batching offer optimization opportunities. There exists some risk that such automations could be misused by attackers to refine exploits more rapidly.

7. Context, Empirical Performance, and Integration in Postmortem Systems

PoCEvaluator represents a key innovation in the automation of DeFi exploit postmortems, providing measurable improvements in PoC quality and reproducibility over prior art such as DeFiHackLabs, STING, and APE. In the TxRay deployment, PoCEvaluator delivered validated root causes in a median of 40 minutes and PoCs in 59 minutes, demonstrating the protocol's practical viability at scale. Its semantic oracle-based approach not only improves response accuracy for incident triage but also broadens coverage and enables robust imitation harnesses for security research and post-incident auditing (Wang et al., 1 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PoCEvaluator Protocol.