PoCEvaluator: Agentic PoC Assessment
- PoCEvaluator is an independent, agentic framework that validates executable blockchain PoCs using multi-agent negotiation and consensus protocols.
- It integrates functional validation with quality assessment through multi-round sub-agent evaluations, ensuring deterministic reproducibility and rigorous forensic analysis.
- The framework is extensible to domains like logical reasoning and multiple-choice QA, demonstrating its versatility in structured, transparent evaluation.
PoCEvaluator is an independent, agentic execution-and-review framework for automated, high-fidelity assessment of executable Proof-of-Concepts (PoCs) in the context of blockchain exploit forensics, with known extensions to multiple-choice question answering evaluation. Serving principally within the TxRay postmortem pipeline for Decentralized Finance (DeFi) incident analysis, PoCEvaluator delivers deterministic, reproducible validation of PoCs, simultaneously enforcing correctness on real blockchain forks and formalizing code quality via structured metrics. The system is architected as a multi-agent negotiation protocol, with autonomous sub-evaluators and a consensus-forming aggregator, and is extensible to domains requiring robust, transparent evaluation traces, as demonstrated in logical reasoning tasks using the Process of Elimination (PoE) methodology (Wang et al., 1 Feb 2026, Ma et al., 2023).
1. System Overview and Core Objectives
PoCEvaluator operates as a standalone, LLM-agentic execution-and-review mechanism designed for two tasks: (a) functional validation—compilation and exploit reproduction on a forked mainnet at a precise block height using PoC artifacts; and (b) quality assessment—enforcing self-containment, avoidance of hard-coded attack-specific artifacts, and explicit assertion coverage within PoC codebases. Downstream of PoC synthesis (notably by TxRay), PoCEvaluator provides feedback for reject-and-refine cycles, ensuring convergence towards high-fidelity, reproducible forensic artifacts certified against explicit, incident-derived semantic oracles (Wang et al., 1 Feb 2026).
2. Architecture and Agent Negotiation Protocol
PoCEvaluator is partitioned into two principal stages, each agent-driven:
Stage (i): Agentic Independent Evaluation
A set of sub-evaluator agents (typically ) independently review the candidate PoC directory. Each agent executes a full workflow—compilation (forge build), execution on forked RPCs (forge test --fork at the incident block), codebase static inspection, and heuristic/rule-based annotation capture. Each agent emits a JSON-structured report logging pass/fail outcomes for each metric, associated rationales, and detected anomalies.
Stage (ii): Multi-round Consensus via Aggregator
The aggregator agent ingests all sub-evaluator reports. For any metric with a divergence in judgments, the aggregator triggers round-robin prompts focused on the points of disagreement. Sub-evaluators may revisit their assessments against aggregated evidence. The negotiation proceeds until either all metrics converge or a maximum round-depth is reached, culminating in a consolidated decision trace. This design ensures reproducibility, enhanced rigor, and resistance to single-agent failure or hallucination (Wang et al., 1 Feb 2026).
3. Execution Workflow and Pseudocode Outline
The execution loop orchestrates independent review and systematic negotiation. The workflow can be summarized in the following pseudocode (abridged for conciseness):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
def PoCEvaluator(poc_dir, N=3): # Stage (i): Independent Evaluation sub_reports = [] for i in range(N): report_i = SubEvaluator(poc_dir) sub_reports.append(report_i) # Stage (ii): Multi-round Negotiation consensus = False rounds = 0 while not consensus and rounds < MAX_ROUNDS: aggregated = AggregateReports(sub_reports) disagreements = FindDisagreements(aggregated) if not disagreements: consensus = True else: for metric in disagreements: for j in indices(metric): sub_reports[j] = PromptReevaluation(sub_reports[j], metric, aggregated.evidence) rounds += 1 return consolidated_final_report(sub_reports) |
Each sub-evaluator proceeds through:
- Compilation:
forge build(C1). - On-chain test execution at the incident block:
forge test --fork(C2, C3). - Static inspection for attacker artifacts, hard-coded parameters (Q1–Q3).
- AST/regex search for explicit assertions (Q4).
- Heuristic discovery of comments and functional address labeling (Q5–Q6). Outputs are recorded as structured JSON per metric per round (Wang et al., 1 Feb 2026).
4. Metric Taxonomy and Formal Evaluation Criteria
PoCEvaluator operationalizes a formal, six-dimensional metric rubric comprising both correctness and quality axes:
| Metric | Description | Pass Criterion |
|---|---|---|
| C1 | Compilation under Foundry | forge build exit code 0 |
| C2 | Execution and oracle satisfaction on fork | All forge test --fork assertions pass |
| C3 | On-chain execution at exact incident block | Tests run on a real forked RPC, correct block height |
| Q1 | Avoidance of attacker/incident artifacts | No imports/interactions with attacker/helper contracts |
| Q2 | Decoupling from real attacker addresses | Use of makeAddr(...) or deterministic test EOAs |
| Q3 | Hard-coding avoidance | Absence of attacker-supplied calldata, magic numbers, or incident parameters |
| Q4 | Assertion coverage | Presence of explicit assert or require statements for success predicates |
| Q5 | Commentary | Code comments annotating non-obvious calls/parameters |
| Q6 | Address labeling | Explicit declaration of address roles (e.g., Attacker, Victim) |
Key derived formulas:
- Reproduction Accuracy:
- Hard-Coding Avoidance Ratio:
- Generalized Coverage Gain:
Each metric is deterministically checked and supported by logs, AST trace, or runtime evaluation against incident-specific semantic oracles (Wang et al., 1 Feb 2026).
5. Agentic Tooling, Oracles, and Code Assertions
Each sub-evaluator agent is equipped with backend tool calls and runtime assertion validation capabilities:
- Build and test:
forge build,forge test --fork BLOCK. - Static and runtime assertion validation: AST parsing, regex search for
assert/require, and evaluation of profit/invariant oracles (e.g.,assert(attacker.balanceAfter - attacker.balanceBefore >= observedProfit)). - Oracle definitions are extracted directly from the PoC (
oracle_definition.json), ensuring semantic alignment between synthesized artifacts and their assessment. Hard and soft oracles are required to be checked at runtime, delivering both factual and semantic coverage of the exploit scenario (Wang et al., 1 Feb 2026).
6. Performance Benchmarks and Comparative Analysis
In empirical evaluation on 114 DeFiHackLabs incidents, PoCEvaluator, in the context of TxRay-generated PoCs, achieved:
- 92.11% end-to-end reproduction accuracy (105/114 incidents).
- On the aligned 105-incident subset:
- C1–C3: 100% pass
- Q1: 99.05%, Q2: 98.1%, Q3: 50.5%, Q4: 100%, Q5: 88.6%, Q6: 100%.
- Relative metric improvements over DeFiHackLabs baseline:
- Q2 (hard-coded address avoidance): +24.8 pp
- Q3 (hard-coding avoidance): +36.2 pp
- Q4 (assertion coverage): +97.1 pp (from 2.9% to 100%)
- In generalized imitation coverage (32 incidents), the TxRay+PoCEvaluator pipeline covered 100%, compared to STING (84.4%) and APE (34.5%), yielding coverage gains of +15.6 pp and +65.5 pp respectively (Wang et al., 1 Feb 2026).
7. Limitations and Future Development Directions
Several current limitations arise from the reliance on LLM-based agentic reasoning:
- PoCEvaluator’s sub-evaluators and aggregator depend on GPT-style LLMs for negotiation and metric rubric interpretation; possible inaccuracies may persist if models misinterpret code or context.
- The negotiation protocol, while robust, can introduce latency for large PoC directories; batching or segmentation via metric-specific micro-agents offers a prospective latency reduction pathway.
- The present rubric, while comprehensive on self-containment and readability, omits security-directed static analyses (e.g., reentrancy, unchecked external calls); integration with formal program analysis or symbolic execution engines is a future avenue.
- The system’s architecture is modular, supporting the extension of the metric set, and is adaptable to structured code review tasks beyond blockchain incident analysis (Wang et al., 1 Feb 2026).
8. PoCEvaluator Extension to Logical Reasoning: PoE Methodology
In the multiple-choice QA domain, the Process of Elimination (PoE) method has been explored as a PoCEvaluator protocol for logical and commonsense reasoning by LMs (Ma et al., 2023). PoE operates in two stages—optionwise elimination and masked final prediction—mirroring the independent evaluation and consensus steps in PoCEvaluator. PoE enables:
- Plausibility filtering: Assigns confusability probability to distractors.
- Confidence calibration: Uses score margins and survivor sets for interpretability.
- Transparent reasoning paths: Retains traceable elimination/selection steps amenable to audit.
- Adaptive error diagnosis: Dissects whether errors stem from distractor confusion or misranking among survivors.
Empirically, PoE achieves high alignment with MCP baselines and shows substantial gains on logical-deduction tasks, indicating the generality of agentic, checklist-driven evaluation protocols for structured output verification (Ma et al., 2023).
PoCEvaluator thus provides a rigorous, agentic, multi-metric adjudication framework for executable artifact assessment, yielding advances in deterministic reproducibility, artifact self-containment, and incident coverage that are extendable to QA and logical reasoning domains (Wang et al., 1 Feb 2026, Ma et al., 2023).