PoCEvaluator: Agentic PoC Assessment

Updated 3 February 2026

PoCEvaluator is an independent, agentic framework that validates executable blockchain PoCs using multi-agent negotiation and consensus protocols.
It integrates functional validation with quality assessment through multi-round sub-agent evaluations, ensuring deterministic reproducibility and rigorous forensic analysis.
The framework is extensible to domains like logical reasoning and multiple-choice QA, demonstrating its versatility in structured, transparent evaluation.

PoCEvaluator is an independent, agentic execution-and-review framework for automated, high-fidelity assessment of executable Proof-of-Concepts (PoCs) in the context of blockchain exploit forensics, with known extensions to multiple-choice question answering evaluation. Serving principally within the TxRay postmortem pipeline for Decentralized Finance (DeFi) incident analysis, PoCEvaluator delivers deterministic, reproducible validation of PoCs, simultaneously enforcing correctness on real blockchain forks and formalizing code quality via structured metrics. The system is architected as a multi-agent negotiation protocol, with autonomous sub-evaluators and a consensus-forming aggregator, and is extensible to domains requiring robust, transparent evaluation traces, as demonstrated in logical reasoning tasks using the Process of Elimination (PoE) methodology (Wang et al., 1 Feb 2026, Ma et al., 2023).

1. System Overview and Core Objectives

PoCEvaluator operates as a standalone, LLM-agentic execution-and-review mechanism designed for two tasks: (a) functional validation—compilation and exploit reproduction on a forked mainnet at a precise block height using PoC artifacts; and (b) quality assessment—enforcing self-containment, avoidance of hard-coded attack-specific artifacts, and explicit assertion coverage within PoC codebases. Downstream of PoC synthesis (notably by TxRay), PoCEvaluator provides feedback for reject-and-refine cycles, ensuring convergence towards high-fidelity, reproducible forensic artifacts certified against explicit, incident-derived semantic oracles (Wang et al., 1 Feb 2026).

2. Architecture and Agent Negotiation Protocol

PoCEvaluator is partitioned into two principal stages, each agent-driven:

Stage (i): Agentic Independent Evaluation

A set of $N$ sub-evaluator agents (typically $N=3$ ) independently review the candidate PoC directory. Each agent executes a full workflow—compilation (forge build), execution on forked RPCs (forge test --fork at the incident block), codebase static inspection, and heuristic/rule-based annotation capture. Each agent emits a JSON-structured report logging pass/fail outcomes for each metric, associated rationales, and detected anomalies.

Stage (ii): Multi-round Consensus via Aggregator

The aggregator agent ingests all $N$ sub-evaluator reports. For any metric with a divergence in judgments, the aggregator triggers round-robin prompts focused on the points of disagreement. Sub-evaluators may revisit their assessments against aggregated evidence. The negotiation proceeds until either all metrics converge or a maximum round-depth is reached, culminating in a consolidated decision trace. This design ensures reproducibility, enhanced rigor, and resistance to single-agent failure or hallucination (Wang et al., 1 Feb 2026).

3. Execution Workflow and Pseudocode Outline

The execution loop orchestrates independent review and systematic negotiation. The workflow can be summarized in the following pseudocode (abridged for conciseness):

def PoCEvaluator(poc_dir, N=3):
    # Stage (i): Independent Evaluation
    sub_reports = []
    for i in range(N):
        report_i = SubEvaluator(poc_dir)
        sub_reports.append(report_i)
    # Stage (ii): Multi-round Negotiation
    consensus = False
    rounds = 0
    while not consensus and rounds < MAX_ROUNDS:
        aggregated = AggregateReports(sub_reports)
        disagreements = FindDisagreements(aggregated)
        if not disagreements:
            consensus = True
        else:
            for metric in disagreements:
                for j in indices(metric):
                    sub_reports[j] = PromptReevaluation(sub_reports[j], metric, aggregated.evidence)
            rounds += 1
    return consolidated_final_report(sub_reports)

Each sub-evaluator proceeds through:

Compilation: forge build (C1).
On-chain test execution at the incident block: forge test --fork (C2, C3).
Static inspection for attacker artifacts, hard-coded parameters (Q1–Q3).
AST/regex search for explicit assertions (Q4).
Heuristic discovery of comments and functional address labeling (Q5–Q6). Outputs are recorded as structured JSON per metric per round (Wang et al., 1 Feb 2026).

4. Metric Taxonomy and Formal Evaluation Criteria

PoCEvaluator operationalizes a formal, six-dimensional metric rubric comprising both correctness and quality axes:

Metric	Description	Pass Criterion
C1	Compilation under Foundry	`forge build` exit code 0
C2	Execution and oracle satisfaction on fork	All `forge test --fork` assertions pass
C3	On-chain execution at exact incident block	Tests run on a real forked RPC, correct block height
Q1	Avoidance of attacker/incident artifacts	No imports/interactions with attacker/helper contracts
Q2	Decoupling from real attacker addresses	Use of `makeAddr(...)` or deterministic test EOAs
Q3	Hard-coding avoidance	Absence of attacker-supplied calldata, magic numbers, or incident parameters
Q4	Assertion coverage	Presence of explicit `assert` or `require` statements for success predicates
Q5	Commentary	Code comments annotating non-obvious calls/parameters
Q6	Address labeling	Explicit declaration of address roles (e.g., `Attacker`, `Victim`)

Key derived formulas:

Reproduction Accuracy:

$\mathrm{ReproductionAccuracy} = \frac{N_{\mathrm{reproduced}}}{N_{\mathrm{total}}} \times 100\%$

Hard-Coding Avoidance Ratio:

$\mathrm{HardCodingAvoidanceRatio} = \frac{\#\mathrm{PoCs \ passing\ Q3}}{N_{\mathrm{aligned}}} \times 100\%$

Generalized Coverage Gain:

$\mathrm{CoverageGain} = \frac{N_{\mathrm{covered,TxRay}} - N_{\mathrm{covered,baseline}}}{N_{\mathrm{incidents}}} \times 100\%$

Each metric is deterministically checked and supported by logs, AST trace, or runtime evaluation against incident-specific semantic oracles (Wang et al., 1 Feb 2026).

5. Agentic Tooling, Oracles, and Code Assertions

Each sub-evaluator agent is equipped with backend tool calls and runtime assertion validation capabilities:

Build and test: forge build, forge test --fork $RPC_URL --fork-block$ BLOCK.
Static and runtime assertion validation: AST parsing, regex search for assert/require, and evaluation of profit/invariant oracles (e.g., assert(attacker.balanceAfter - attacker.balanceBefore >= observedProfit)).
Oracle definitions are extracted directly from the PoC (oracle_definition.json), ensuring semantic alignment between synthesized artifacts and their assessment. Hard and soft oracles are required to be checked at runtime, delivering both factual and semantic coverage of the exploit scenario (Wang et al., 1 Feb 2026).

6. Performance Benchmarks and Comparative Analysis

In empirical evaluation on 114 DeFiHackLabs incidents, PoCEvaluator, in the context of TxRay-generated PoCs, achieved:

92.11% end-to-end reproduction accuracy (105/114 incidents).
On the aligned 105-incident subset:
- C1–C3: 100% pass
- Q1: 99.05%, Q2: 98.1%, Q3: 50.5%, Q4: 100%, Q5: 88.6%, Q6: 100%.
Relative metric improvements over DeFiHackLabs baseline:
- Q2 (hard-coded address avoidance): +24.8 pp
- Q3 (hard-coding avoidance): +36.2 pp
- Q4 (assertion coverage): +97.1 pp (from 2.9% to 100%)
In generalized imitation coverage (32 incidents), the TxRay+PoCEvaluator pipeline covered 100%, compared to STING (84.4%) and APE (34.5%), yielding coverage gains of +15.6 pp and +65.5 pp respectively (Wang et al., 1 Feb 2026).

7. Limitations and Future Development Directions

Several current limitations arise from the reliance on LLM-based agentic reasoning:

PoCEvaluator’s sub-evaluators and aggregator depend on GPT-style LLMs for negotiation and metric rubric interpretation; possible inaccuracies may persist if models misinterpret code or context.
The negotiation protocol, while robust, can introduce latency for large PoC directories; batching or segmentation via metric-specific micro-agents offers a prospective latency reduction pathway.
The present rubric, while comprehensive on self-containment and readability, omits security-directed static analyses (e.g., reentrancy, unchecked external calls); integration with formal program analysis or symbolic execution engines is a future avenue.
The system’s architecture is modular, supporting the extension of the metric set, and is adaptable to structured code review tasks beyond blockchain incident analysis (Wang et al., 1 Feb 2026).

8. PoCEvaluator Extension to Logical Reasoning: PoE Methodology

In the multiple-choice QA domain, the Process of Elimination (PoE) method has been explored as a PoCEvaluator protocol for logical and commonsense reasoning by LMs (Ma et al., 2023). PoE operates in two stages—optionwise elimination and masked final prediction—mirroring the independent evaluation and consensus steps in PoCEvaluator. PoE enables:

Plausibility filtering: Assigns confusability probability to distractors.
Confidence calibration: Uses score margins and survivor sets for interpretability.
Transparent reasoning paths: Retains traceable elimination/selection steps amenable to audit.
Adaptive error diagnosis: Dissects whether errors stem from distractor confusion or misranking among survivors.

Empirically, PoE achieves high alignment with MCP baselines and shows substantial gains on logical-deduction tasks, indicating the generality of agentic, checklist-driven evaluation protocols for structured output verification (Ma et al., 2023).

PoCEvaluator thus provides a rigorous, agentic, multi-metric adjudication framework for executable artifact assessment, yielding advances in deterministic reproducibility, artifact self-containment, and incident coverage that are extendable to QA and logical reasoning domains (Wang et al., 1 Feb 2026, Ma et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

TxRay: Agentic Postmortem of Live Blockchain Attacks (2026)

POE: Process of Elimination for Multiple Choice Reasoning (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PoCEvaluator.