TxRay: Agentic Postmortem of Live Blockchain Attacks

Published 1 Feb 2026 in cs.CR and cs.AI | (2602.01317v1)

Abstract: Decentralized Finance (DeFi) has turned blockchains into financial infrastructure, allowing anyone to trade, lend, and build protocols without intermediaries, but this openness exposes pools of value controlled by code. Within five years, the DeFi ecosystem has lost over 15.75B USD to reported exploits. Many exploits arise from permissionless opportunities that any participant can trigger using only public state and standard interfaces, which we call Anyone-Can-Take (ACT) opportunities. Despite on-chain transparency, postmortem analysis remains slow and manual: investigations start from limited evidence, sometimes only a single transaction hash, and must reconstruct the exploit lifecycle by recovering related transactions, contract code, and state dependencies. We present TxRay, a LLM agentic postmortem system that uses tool calls to reconstruct live ACT attacks from limited evidence. Starting from one or more seed transactions, TxRay recovers the exploit lifecycle, derives an evidence-backed root cause, and generates a runnable, self-contained Proof of Concept (PoC) that deterministically reproduces the incident. TxRay self-checks postmortems by encoding incident-specific semantic oracles as executable assertions. To evaluate PoC correctness and quality, we develop PoCEvaluator, an independent agentic execution-and-review evaluator. On 114 incidents from DeFiHackLabs, TxRay produces an expert-aligned root cause and an executable PoC for 105 incidents, achieving 92.11% end-to-end reproduction. Under PoCEvaluator, 98.1% of TxRay PoCs avoid hard-coding attacker addresses, a +24.8pp lift over DeFiHackLabs. In a live deployment, TxRay delivers validated root causes in 40 minutes and PoCs in 59 minutes at median latency. TxRay's oracle-validated PoCs enable attack imitation, improving coverage by 15.6% and 65.5% over STING and APE.

Abstract PDF Upgrade to Chat

Summary

The paper introduces TxRay, an LLM-driven system that automates the reconstruction of live blockchain ACT exploits in DeFi ecosystems.
The methodology employs multi-agent orchestration for evidence collection, root-cause analysis, and deterministic PoC generation, achieving a 92.11% reproduction rate.
The system reduces analysis latency and enhances forensic accuracy, demonstrating up to 24.8pp improvement in self-containment over manual methods.

Agentic Reconstruction of Blockchain ATT&CKs: An Academic Review of "TxRay: Agentic Postmortem of Live Blockchain Attacks"

Introduction

The paper "TxRay: Agentic Postmortem of Live Blockchain Attacks" (2602.01317) introduces TxRay, an agentic LLM-driven postmortem system for reconstructing live blockchain exploits—specifically, "Anyone-Can-Take" (ACT) opportunities prevalent in DeFi ecosystems. TxRay leverages evidence collections and tool-based automation to recover the exploit lifecycle from seed transactions, deduce evidence-backed root causes, and synthesize deterministic, self-contained Proof-of-Concepts (PoCs) for exploit reproduction. Evaluation is executed via a multi-agent PoCEvaluator protocol, enabling high-fidelity correctness and semantic quality guarantees.

Problem Statement and Motivation

Permissionless smart contract ecosystems, such as DeFi protocols, remain vulnerable to ACT exploits—sequences of on-chain actions executable by any participant via public interfaces and state. Despite blockchain transparency, post-incident analysis and reproduction are largely manual, slow, and fragmented due to the absence of standardized, executable, incident-level datasets. TxRay aims to automate and systematize postmortem workflows, bridging the latency and reproducibility gap left by manual efforts and heuristics-dependent community trackers (e.g., DeFiHackLabs, DefiLlama).

TxRay System Design

TxRay orchestrates six specialized subagents across a centralized session workspace. The workflow comprises iterative on-chain evidence collection, root cause analysis, validation via semantic oracles, PoC synthesis, and quality scoring. Each agent operates through tool-call interfaces, supporting modularity and coordinated artifact handoff.

Figure 1: TxRay’s orchestrator coordinates six subagents, collectively implementing evidence collection, reasoning, semantic oracle derivation, PoC generation, and validation.

Root Cause Analysis

Data Collector: Fetches transaction metadata, traces, diffs, and contract artifacts via explorer and RPC endpoints. Supports both verified and bytecode-only contract cases.
Analyzer: Expands seed transactions into the full exploit lifecycle, localizes mechanism, and emits grounded root-cause explanations referencing code paths and invariant violations.
Challenger: Validates hypotheses for evidence coverage, causal completeness, and actionable, code-level breakpoints.

PoC Generation and Validation

Oracle Generator: Specifies incident-specific semantic predicates (profit/invariant) for validity checks, decoupling reproduction from attacker artifacts.
Reproducer: Creates Foundry projects for replay on forked chain state using fresh EOA roles; enforces parameter-free, attacker-agnostic conditions.
Validator: Executes PoCs and enforces rubric compliance: success predicates, self-containment (no magic numbers, no attacker addresses/contracts), readability, and explicit lifecycle documentation.

PoCEvaluator for Agentic Reproduction Assessment

PoC correctness and quality are systematically measured via PoCEvaluator, an agentic multi-round negotiation and execution protocol.

Figure 2: PoCEvaluator decomposes evaluation into independent agentic execution, report aggregation, and consensus-driven metrics negotiation.

Metrics include:

Correctness (C1–C3): Foundry compilation, runtime execution on forked state, and oracle satisfaction.
Quality (Q1–Q6): Self-containment (no attacker-side artifacts/addresses/parameters), explicit assertions, readability/documentation.

Empirical results show a 92.11% end-to-end PoC reproduction rate (on 114 incidents), with TxRay PoCs achieving 98.1% address decoupling, 100% assertion coverage, and up to +24.8pp improvement in self-containment over manual baselines (DeFiHackLabs).

Figure 3: PoCEvaluator pass counts highlight consistent coverage gains of TxRay over prior baselines in self-containment and success predicate encoding.

Empirical Results: Latency, Cost, and Component Bottlenecks

TxRay exhibits median times of 29.39 minutes for root-cause diagnosis and 10.65 minutes for PoC generation, with API costs as low as \$3.16 per analysis under gpt-5.1 pricing.

Figure 4: Cost and runtime distribution demonstrate the practical feasibility of deploying TxRay in forensic/security operations.

Latency breakdown attributes most delay to reasoning-centric agent steps, with supporting tooling (data collection, validation) contributing minor wall-clock overhead.

Figure 5: Component-level latency outlines reasoning and synthesis as primary sources of pipeline delay.

Figure 6: Downstream tool-call latency distributions validate efficient utilization of RPCs, APIs, and local executions.

Live Pipeline and Ablation Analysis

Prospective deployment over a 62-day period shows TxRay outperforming public human expert responses in root cause alignment and latency in 62% of eligible incidents, with correct PoC generation in all aligned cases within median 59 minutes. Ablation studies indicate explicit workflow modularity further improves root-cause accuracy and reproducibility rates over monolithic/single-agent approaches.

Generalized Attack Imitation and Dataset Curation

TxRay's oracle-validated, self-contained PoCs facilitate generalized exploit imitation and benchmarking, achieving higher incident coverage relative to the state-of-the-art (STING, APE). TxRay covers all 32 attacks in the used historical benchmark window, extending opportunity recovery by 65.5% (over STING) and 15.6% (over APE).

Implications and Future Directions

TxRay formalizes and automates postmortem reproduction for ACT exploits in permissionless blockchains, drastically narrowing analysis latency and reproducibility gaps. The standardized dataset curation enables downstream tasks: benchmarking IDS, training/fine-tuning forensic LLMs, and automated exploit/fuzzing pipelines. Practical deployment exposes opportunities for further performance optimization (sub-block latency, parallel agent orchestration) and evidence closure in black-box contract scenarios.

Future research should explore:

Extending threat models to non-ACT exploits and probabilistic attacks (private-key, social engineering).
Enhancing reasoning reliability in protocol-level invariants via retrieval-augmented LLM frameworks.
Porting the multi-agent design to alternative LLM stacks and open-source agentic toolchains.
Investigating dual-use concerns around exploit automation.

Conclusion

TxRay presents a comprehensive, agentic, evidence-driven pipeline for automated postmortem reconstruction of live blockchain exploits. Its grounding in executable semantics, oracle validation, and modular artifact workflow establishes a new paradigm for scalable, verifiable DeFi incident forensics and dataset curation. This approach materially advances reproducibility, coverage, and response latency, supporting robust practitioner and research applications within smart contract security.

Figure 7: TxRay pipeline overview: incidents sourced, transaction(s) seeded, agentic root-cause analysis, PoC synthesis, and benchmarking integration.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces TxRay, a smart assistant that investigates hacks on blockchains (especially DeFi, or decentralized finance) right after they happen. Think of TxRay like a digital crime-scene team: it watches a suspicious on-chain transaction, figures out what went wrong, explains the true cause, and builds a small, safe “replay” of the attack in a test environment to prove it.

The big idea is to make postmortems (after-the-fact explanations) fast, accurate, and reproducible—so people can fix problems quickly and learn from them.

The big questions the paper asks

The authors focus on three simple questions:

Can an AI assistant turn a tiny clue (like one transaction) into a complete, correct explanation of a DeFi attack that anyone can verify?
Can it also build a working “replay” program (called a PoC, short for Proof of Concept) that shows the attack really works, without depending on the original attacker’s accounts or special setup?
Is this fast and reliable enough to help during real, live incidents?

How did the researchers approach it?

TxRay is built around a LLM—a smart program that plans steps, uses tools, and checks its own work. The team designed TxRay to do three big jobs, each with its own specialized helpers:

Collect evidence: Like pulling security camera footage, TxRay gathers public on-chain data about the suspicious transaction—what was called, which contracts were involved, and what changed.
Explain the cause: It then writes a clear, evidence-backed story of how the attack worked and which rule or code path broke. It also expands beyond the first transaction to find other related steps (like approvals or contract deployments) that were part of the attack.
Prove it with a replay: Finally, it writes a small program in a popular testing toolkit (Foundry) that replays the attack on a forked copy of the blockchain at the moment it happened. This replay uses fresh, “clean” addresses instead of any attacker’s real addresses.

To keep itself honest, TxRay adds “semantic oracles”—simple, exact tests in the replay that must be true if the explanation is right. For example:

Did the attacker end up with more of a certain token?
Did a safety rule get broken?
Did ownership or permissions change in a way they shouldn’t?

If these checks fail, TxRay revises its work until they pass, or it rejects the case as not fitting the “anyone-can-take” pattern.

What is an “anyone-can-take” (ACT) attack?

It’s like a big red button left in public. Anyone can press it without special permissions or secret info. In blockchain terms, it means a sequence of normal, public calls that any user could make to cause profit or break a rule. TxRay focuses on these ACT cases so its replays are public, fair tests—not relying on private keys or insider tricks.

How does TxRay check quality?

The authors built a separate reviewer called PoCEvaluator. It independently runs each replay and grades it for:
- Correctness (does it compile, run, and use a real block-state fork?)
- Quality (is it self-contained, readable, and free of attacker-specific shortcuts like hard-coded attacker addresses?)

What did they find?

In short: TxRay works well, quickly, and cleanly.

Here are the key results, written in everyday terms:

Strong accuracy and reproducibility:
- On 114 real incidents from a public repo (DeFiHackLabs), TxRay produced correct explanations and working replays for 105 cases (about 92% success end-to-end).
- Its replays almost never “cheat”: 98% avoided hard-coding attacker addresses and 99% avoided relying on attacker-deployed helper contracts. This means the replays stand on their own.
Fast during live attacks:
- In real-time tests over two months, TxRay typically delivered:
- A validated root-cause explanation in about 40 minutes (median),
- A working replay in about 59 minutes (median).
- When there were public expert responses to compare against, TxRay was faster with the correct cause in 62% of those cases, beating experts by a median of a little over 3 hours.
Better coverage than prior tools for attack imitation:
- On a known comparison period, TxRay uncovered additional opportunities and improved coverage by 65.5% over STING and 15.6% over APE (two earlier systems for automatically imitating attacks).
A cleaner, standardized dataset:
- The team compiled 196 high-quality, executable incidents across 9 chains into one dataset, with consistent labels and replays. This helps researchers test and compare tools fairly.
Practical and cost-conscious:
- The median API costs they estimate per run are low (a few dollars) for both analysis and replay generation.

Why do these results matter?

Speed and clarity reduce panic and guesswork. During an attack, teams need to know exactly what’s happening. Early wrong guesses can mislead defenders and leave other projects exposed.
Self-contained replays provide proof. Anyone can run them to verify the explanation or learn from it.
A standard dataset advances research. With consistent, reusable cases, the community can build better defenses and compare methods fairly.

Why it matters and what could happen next

Faster, safer DeFi: TxRay can help projects identify true root causes quickly, so they can pause, patch, or warn users before similar attacks spread.
Better training and benchmarking: High-quality, executable cases help researchers and tools learn real patterns of failure, leading to stronger detectors and audit tools.
Less misinformation: Clear, evidence-backed postmortems prevent the spread of incorrect “hot takes” during stressful incidents.
Responsible testing: By replaying attacks in a safe, forked environment, teams can verify risks and practice responses without harming real funds.

In short, TxRay turns blockchain incidents into clear, testable lessons. That makes the ecosystem more understandable for people today, and much safer for everyone tomorrow.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a concise, actionable list of the paper’s unresolved knowledge gaps, limitations, and open questions that could guide future research.

ACT-only scope: How to extend beyond Anyone-Can-Take opportunities to incidents requiring privileged access, private orderflow, admin key compromise, governance capture, or off-chain/social-engineering vectors that are currently excluded.
Non-EVM ecosystems: Feasibility and design changes needed to support non-EVM chains (e.g., Solana, Cosmos SDK, Move-based chains), where tracing, simulation, and tooling differ fundamentally.
Cross-domain/bridging attacks: Systematic handling of cross-chain and cross-rollup exploits (bridge-specific semantics, message finality, delayed/layered state), which the current system model does not explicitly support.
Private orderflow and builder ecosystems: Addressing attacks/opportunities that depend on private mempools, orderflow auctions, or builder-specific features that are not publicly observable under the current system model.
ABI/data incompleteness: Robust semantic decoding when ABIs are missing, incorrect, or partially available; principled fallbacks for calldata/bytecode lifting beyond heuristic decompilation.
Bytecode lifting fidelity: Measuring and improving the accuracy of open-source disassembly/decompilation (e.g., Heimdall) used in place of proprietary tools; quantifying its impact on root-cause precision.
Lifecycle mining robustness: Algorithms to reliably recover full exploit lifecycles from limited seeds, especially when seeds reflect profit extraction or downstream effects rather than vulnerability triggers.
Protocol-intent reasoning: Methods to better map low-level execution to protocol-level invariants and intended behavior (e.g., formal specifications, invariant mining) to reduce misattribution and ACT boundary confusion.
Oracle design completeness: Detecting and preventing oracle under-specification or overfitting to the hypothesized mechanism (e.g., cross-checks against alternative explanations, negative tests, mutation testing of oracles).
Confirmation bias in validation: Independent, non-LLM validation of LLM-generated root-cause reports and oracles to mitigate self-confirmation within the same agentic pipeline.
Evaluator circularity: PoCEvaluator is LLM-based; need human-scale, blinded, and statistically powered validation to establish reliability, calibration, and robustness of the evaluator itself across diverse exploit types.
Inter-rater reliability: Reporting formal agreement (e.g., Cohen’s κ) and adjudication protocols among expert reviewers for ACT identification and root-cause alignment to strengthen validity claims.
Benchmark representativeness: Assessing sampling bias from relying on DeFiHackLabs seeds and a 426-day window; quantifying coverage of the broader incident universe and unseen classes of exploits.
Baseline fairness: Evaluating whether quality metrics (e.g., penalizing reuse of attacker addresses or parameters) favor TxRay’s design choices; complement with baselines optimized for these metrics to ensure fair comparisons.
Ablation studies: Quantifying contributions of each subagent (collector, analyzer, challenger, oracle generator, reproducer, validator) and iteration budgets to overall performance, latency, and cost.
Failure-mode analytics: A taxonomy and measurable remediation strategies for the four identified failure classes (ABI decoding, lifecycle reconstruction, protocol-intent reasoning, ACT boundary confusion) with targeted fixes and re-evaluation.
Scalability and throughput: Performance under concurrent incidents, large multi-tx lifecycles, deep call graphs, and chains with heavy tracing overhead; queueing, caching, and parallelization strategies.
RPC/Explorer dependency risk: Robustness to rate limits, outages, or API changes (QuickNode, Etherscan v2); fallbacks, redundancy, and self-hosted archival infrastructure to reduce centralized dependencies.
Deterministic replay portability: Reproducibility across different node clients, RPC providers, and tracer implementations; documenting divergences and establishing replay conformance tests.
Time-dependent mechanisms: Handling incidents dependent on timing/TWAP updates, sequencer behavior, or block-specific randomness; general methods to reproduce such conditions without brittle hard-coding.
Cost stability and model dependence: Sensitivity of performance and cost to LLM model drift, pricing changes, and availability; strategies for on-prem or open-weight models and hybrid symbolic-ML systems.
Live pipeline recall: Measuring recall and false positives of the live alert pipeline (Twitter/X sourcing) vs. a comprehensive on-chain incident oracle; quantifying missed incidents and alerting bias.
Imitation in-the-wild: Moving from simulated profitable bundles to live, competitive MEV-time imitation with realistic inclusion constraints, private orderflow competition, and builder policies.
Ethical and dual-use risks: Policies for responsible disclosure, embargo timing, and redaction to reduce copycat risk when releasing runnable PoCs and generalized imitation tooling.
Dataset governance: Clear licensing, versioning, provenance, and continuous quality control (e.g., automated replay checks) for the released dataset; guarantees against self-selection bias from TxRay-accepted cases only.
Non-monetary predicate rigor: Formalizing and standardizing non-monetary exploit predicates and their measurement to avoid subjective oracle definitions and improve comparability across incidents.
Economic-grounding checks: Systematic differentiation between protocol-intended value flows (e.g., fee-on-transfer, incentive distributions) and exploit-induced flows to reduce ACT boundary misclassification.
Defense impact metrics: Measuring operational value for defenders (time-to-understanding, loss averted), downstream mitigations (patches, pausings), and how faster, more accurate postmortems change outcomes.
Integration with IDS: End-to-end studies integrating TxRay with real-time detectors, prioritization logic, and human-in-the-loop triage; effects on alert fatigue and response workflows.
Generalization across chains: Empirically validating TxRay on less-indexed EVM chains with weaker explorer support and different gas/fee markets; quantifying portability and failure patterns.
Security of execution environment: Threat modeling the local Foundry-based reproduction environment (e.g., malicious on-chain bytecode); sandboxing and resource isolation to prevent off-chain harm.
Parameter-free guarantee stress tests: Systematic audits to ensure PoCs truly avoid attacker-side artifacts and magic constants across diverse exploit categories, not only on the aligned subset.
Transparency of soft-constraint thresholds: Methods to set, justify, and adapt profit/depletion tolerances in oracles; sensitivity analyses showing robustness to threshold choices.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now, leveraging TxRay’s agentic postmortem pipeline, oracle-validated PoCs, PoCEvaluator, and standardized dataset. Each bullet notes sector alignment, potential tools/products/workflows, and assumptions or dependencies that affect feasibility.

Finance and Web3 Security

Automated incident postmortem and triage in Security Operations Centers (SOC)
- Sector: Finance (DeFi), Cybersecurity
- What: Integrate TxRay with existing IDS/monitoring (e.g., Phalcon, Hypernative) to convert on-chain alerts into evidence-backed root-cause reports and self-contained, executable PoCs within ~40–60 minutes median.
- Tools/Workflows: “Postmortem-as-a-Service” runbooks; Foundry PoC bundle handoffs; JIRA ticket enrichment with oracle-validated tests; SIEM connectors.
- Assumptions/Dependencies: EVM-compatible chains; archive RPC with debug tracing; explorer APIs; Foundry fork execution; availability of public ABIs or robust bytecode tooling; stable canonical chain state.
Rapid cross-chain vulnerability exposure assessment
- Sector: Finance (DeFi risk), Asset management
- What: Use TxRay’s standardized taxonomy and dataset (196 incidents across 9 chains) to map protocol exposures, prioritize patching/pausing, and estimate incident similarity (e.g., Balancer-like precision loss).
- Tools/Workflows: Exposure dashboards; similarity scores to known root causes; governance runbooks; “hot-fix” playbooks for parameter updates.
- Assumptions/Dependencies: Reliable incident labeling and taxonomy; up-to-date dataset snapshots; organizational processes for rapid parameter changes.
Insurance claims validation and underwriting analytics
- Sector: Finance (crypto insurance), Regtech
- What: Validate claims with oracle-validated PoCs; benchmark protocol risk using standardized attack predicates (profit/non-monetary), improving actuarial modeling and pricing.
- Tools/Workflows: Claims adjudication pipelines; underwriting scoring that references reproducible artifacts; portfolio-level risk models leveraging PoCEvaluator quality signals.
- Assumptions/Dependencies: Access to incident artifacts; agreement on standardized exploit predicates; insurer/regulator acceptance of on-chain reproducibility evidence.
Defender MEV imitation for fund rescue and loss mitigation
- Sector: Finance (MEV/ops), Security engineering
- What: Reuse TxRay’s PoCs and oracle constraints to synthesize generalized counter-bundles that imitate and preempt exploit flows (validated coverage gains over APE/STING).
- Tools/Workflows: Bundle synthesis integrated with builder/relay feeds; deterministic assertions guiding counter-bundle composition; on-call “rescue bot” operators.
- Assumptions/Dependencies: Mempool/builder visibility; deterministic replay on forks; transaction inclusion guarantees; ethical and legal review of countermeasures.

Software Engineering and DevOps

CI/CD invariant testing with semantic oracles
- Sector: Software (smart contracts), DevOps
- What: Encode protocol-specific invariants as TxRay-style semantic oracles in Foundry tests; catch ACT opportunity regressions pre-deployment.
- Tools/Workflows: GitHub Actions with forge test; oracle libraries for common DeFi invariants; red-team test suites derived from the dataset.
- Assumptions/Dependencies: Foundry in the build pipeline; developer adoption of invariants; disciplined test maintenance across upgrades.
PoC quality assurance and submission review
- Sector: Cybersecurity, Bug bounty platforms
- What: Use PoCEvaluator to triage external PoCs (audit submissions, bounty reports) for correctness and self-containment (avoid attacker-side artifacts, real attacker addresses, magic constants).
- Tools/Workflows: Multi-agent evaluation gate; standardized checklists (C1–C3, Q1–Q6); automated feedback loops to authors.
- Assumptions/Dependencies: Execution on pinned forks; access to RPC/explorer data; consistency with bounty program standards.

Academia and Education

Reproducible DeFi security lab modules
- Sector: Education, Research
- What: Teach exploit classes with runnable Foundry projects and oracle assertions; guide students through seed-to-lifecycle reconstruction and evidence-based reasoning.
- Tools/Workflows: Course lab bundles; annotated PoCs; Valinity-style walkthrough artifacts; exam exercises on ACT formalization.
- Assumptions/Dependencies: Course infrastructure for Foundry; dataset availability; proper sandboxing and ethical guidance.
Benchmarking IDS and LLM agents on standardized incidents
- Sector: Academia (systems/security), AI
- What: Use the curated dataset to measure detection coverage, postmortem lag, and agentic reasoning quality; compare systems with reproducible artifacts.
- Tools/Workflows: Shared benchmarks; leaderboards; ablation studies on tooling (ABI vs bytecode); replication packages.
- Assumptions/Dependencies: Open access to datasets; stable evaluation protocols; community buy-in.

Policy, Compliance, and Governance

Evidence-backed incident reporting standards
- Sector: Policy, Compliance
- What: Establish regulator-facing reports with root-cause narratives and runnable PoCs; improve transparency and reduce misclassification.
- Tools/Workflows: Standardized JSON schemas; artifact bundles attached to disclosures; oracle predicates for non-monetary harms (e.g., asset freeze).
- Assumptions/Dependencies: Regulator acceptance; secure sharing of artifacts; chain-specific legal context.

End-User and Community Tools

Wallet and dApp warnings powered by postmortem signals
- Sector: Consumer finance (wallets), UX
- What: Surface recent validated incidents and affected contract addresses; warn users before interactions with vulnerable deployments.
- Tools/Workflows: Lightweight risk APIs; in-wallet banners/tooltips; opt-in “vulnerability watchlists.”
- Assumptions/Dependencies: Timely ingestion of TxRay outputs; accurate address resolution (proxy/implementation); low false-positive rates.
Investor due diligence and protocol “risk snapshots”
- Sector: Retail/Institutional crypto investing
- What: Provide pre-deposit checks against standardized incident classes; run oracle tests on forks to sanity-check protocol behavior.
- Tools/Workflows: Risk dashboards; “Run PoC on fork” buttons for analysts; portfolio exposure summaries.
- Assumptions/Dependencies: Access to forks; expertise to interpret outcomes; responsible disclosure coordination.

Long-Term Applications

These applications require further research, scaling, or development (e.g., broader chain coverage, stronger models, policy alignment, or operational maturity).

Autonomic Real-Time Defense and Governance

Closed-loop “detect → explain → defend” systems
- Sector: Cybersecurity, Protocol governance
- What: Chain monitors trigger TxRay-like postmortems; validated oracles auto-propose mitigations (parameter changes, circuit-breakers, pausing), subject to human-in-the-loop approval or emergency guardians.
- Tools/Workflows: Governance bots; safe-guard rails for emergency actions; simulation of proposed changes on forks prior to execution.
- Assumptions/Dependencies: Protocol support for pause/parameter updates; robust false-positive handling; governance legitimacy and fail-safes.
Autonomous counter-bundles and fund rescues
- Sector: MEV infrastructure, Security engineering
- What: Real-time synthesis and submission of rescue bundles with ethical constraints and protocol coordination (e.g., whitehat rescues, “first responder” networks).
- Tools/Workflows: Builder/relay integrations; shared whitehat coordination channels; escrow and return mechanisms.
- Assumptions/Dependencies: Reliable detection, fast search/simulation; legal clarity; community norms preventing perverse incentives.

Cross-Ecosystem Generalization

Extension beyond EVM to non-EVM chains and privacy-preserving L2s
- Sector: Multi-chain infrastructure
- What: Generalize ACT formalization, postmortem pipelines, and oracle tests to Solana, Cosmos SDK chains, Move-based systems, and rollups with private orderflow.
- Tools/Workflows: Cross-chain tracing adapters; chain-specific oracle libraries; abstract execution semantics modules.
- Assumptions/Dependencies: Availability of archival traces and deterministic replay; explorer equivalents; differences in transaction visibility.

Formal Methods and Program Analysis Integration

Oracle-to-formal-spec compilers for verification
- Sector: Software (formal verification), Security
- What: Translate semantic oracles into formal properties (e.g., Coq/Isabelle, SMT) and integrate with analyzers (Slither, Echidna, Certora) to check ACT surfaces pre-deployment.
- Tools/Workflows: Spec generation pipelines; counterexample-guided test synthesis; invariant regression tracking.
- Assumptions/Dependencies: Adequate property expressiveness; automated alignment between code and specs; developer adoption.
Bytecode-first semantics learning for unverified contracts
- Sector: AI for Code, Security
- What: Train domain LLMs on opcode-level traces and decompilation artifacts to improve semantic decoding when ABIs/sources are missing.
- Tools/Workflows: Curated low-level datasets; synthetic labeling via on-chain diffs; multi-agent reasoning over traces.
- Assumptions/Dependencies: High-quality trace corpora; compute budgets; careful evaluation to avoid hallucinations.

Market Infrastructure, Insurance, and Risk

Dynamic insurance pricing and capital requirements based on ACT exposure
- Sector: Finance (insurance), Policy
- What: Use real-time ACT opportunity metrics to adjust premiums/reserves; regulators adopt ACT-based risk tiers for protocol licensing or disclosures.
- Tools/Workflows: Exposure estimators; periodic regulatory filings; “ACT risk scores” APIs.
- Assumptions/Dependencies: Accepted methodologies; avoidance of pro-cyclical behaviors; privacy and fairness considerations.
Industry-wide incident registries with reproducible artifacts
- Sector: Policy, Standards
- What: A neutral consortium stewardship of standardized incident records (root causes, PoCs, oracles), enabling ecosystem-wide benchmarking and transparency.
- Tools/Workflows: Governance for submission and review; long-term archiving; versioning of artifacts.
- Assumptions/Dependencies: Stakeholder buy-in; funding; safe harbor protections for disclosure.

Education and Workforce Development

Large-scale training of security analysts and AI agents
- Sector: Education, AI
- What: Use the dataset and tooling to upskill analysts; fine-tune specialized LLMs for on-chain reasoning, improving postmortem speed and accuracy.
- Tools/Workflows: Bootcamps; simulated incident drills; benchmark suites for agent evaluation.
- Assumptions/Dependencies: Access to high-quality, continually updated data; responsible use policies; alignment with industry needs.

Global Assumptions and Dependencies

Technical: EVM-compatible chains with archive RPC, trace APIs (e.g., debug_traceTransaction), explorer metadata (ABIs/source), deterministic fork execution via Foundry or equivalents; stable canonical chain state; ability to pin forks at specific blocks/hashes.
Operational: Reliable alerting pipelines; secure storage of API credentials; process integration with SOCs/auditors/governance; time-to-decision requirements.
Model and Tooling: LLM availability and cost constraints; controlled agent loops with validation; robust handling of unverified contracts (bytecode-first analysis).
Scope: ACT exploits are permissionless and publicly verifiable; excludes social engineering/phishing; mempool or builder visibility may be required for real-time imitation/defense.
Policy/Ethics: Clear frameworks for responsible disclosure, rescue operations, and automated mitigations; regulator acceptance of reproducible artifacts; stakeholder coordination to avoid unintended market impacts.

View Paper Prompt View All Prompts

Glossary

ABI (Application Binary Interface): The standardized interface that defines how to encode/decode calls and data between off-chain clients and on-chain contracts. "When contract source code and \acp{ABI} are published via public explorers (e.g., Etherscan-style services), we assume they are accurate and publicly visible."
Account-based state machine: A blockchain model where accounts and contracts share global state updated deterministically by transactions. "Public permissionless blockchains in our setting implement a deterministic, account-based state machine."
ACT (Anyone-Can-Take) opportunity: A permissionless, publicly verifiable exploit that any unprivileged actor can reproduce using on-chain data and standard interfaces. "Many exploits arise from permissionless opportunities that any participant can trigger using only public state and standard interfaces, which we call \ac{ACT} opportunities."
Adversarial interactions: Unintended, harmful cross-protocol behaviors that lead to violations of safety or liveness properties. "\ac{DeFi} attacks are sequences of on-chain actions that violate protocol safety or liveness properties, e.g., by exploiting logic bugs, misconfigured parameters, or adversarial interactions between protocols."
Arbitrage: A class of MEV opportunities that extract value by exploiting price differences across markets. "Some are classical \ac{MEV} (arbitrage, liquidations, generalized front-/back-runs)."
Archive RPC endpoints: Specialized RPC services that provide historical state and trace data required for analysis and replay. "Each run uses a session directory containing a .env file with credentials: (i) OPENAI KEY; (ii) ETHERSCAN KEY; and (iii) QUICKNODE KEY for archive RPC endpoints."
Attack lifecycle: The full sequence of actions and transactions composing an exploit, from setup to execution and extraction. "We design TxRay, an agentic \ac{LLM}-based postmortem system that turns individual \ac{ACT} incidents into structured artifacts: a root-cause report, a parameter-free executable \ac{PoC}, and a standardized attack lifecycle."
Back-/front-run: Transaction ordering strategies that exploit visibility of pending transactions to capture value by inserting before or after target transactions. "sequences of transactions that extract value by reordering, inserting, or copying interactions (e.g., arbitrage, liquidations, sandwiches, and generalized back-/front-runs)."
Builder/relay feeds: Public channels where transaction bundles are broadcast before inclusion, observable to adversaries. "observable to the adversary via public infrastructure (e.g., the mempool or builder/relay feeds)"
Bytecode: The low-level compiled representation of smart contracts executed by the EVM when source is unavailable. "For unverified contracts, we assume only bytecode is available and ground analysis in call interfaces, traces, and state diffs."
Call graph: A representation of contract-to-contract calls and their structure during transaction execution. "including call graphs and per-opcode effects"
Calldata: The input bytes supplied to a contract call that encode function selectors and parameters. "It avoids using the real attacker \acp{EOA}, attacker-deployed contract addresses, incident calldata, or attacker-side artifacts."
Canonical chain: The chain history selected by consensus, used as the authoritative reference for state and blocks. "We assume an \ac{EVM}-compatible blockchain with (i) a canonical chain chosen by the underlying consensus protocol"
Collateralization invariant: A safety property ensuring positions remain adequately collateralized. "a collateralization invariant is broken"
Composability: The ability of contracts and protocols to interoperate and build on each other, enabling complex interactions. "Their composability enables complex cross-protocol interactions"
Consensus protocol: The mechanism by which nodes agree on the canonical chain and block ordering. "a canonical chain chosen by the underlying consensus protocol"
debug_traceTransaction: An RPC method that returns detailed execution traces for a transaction. "public \ac{RPC} endpoints (e.g., eth_getTransactionReceipt and debug_traceTransaction with prestateTracer)"
Decompilation: Recovering higher-level code structure from bytecode to aid analysis when source is unavailable. "and optional decompilation with tools such as Heimdall~\cite{heimdall_rs}"
Decentralized Finance (DeFi): On-chain financial protocols and products implemented via smart contracts. "\ac{DeFi} has turned blockchains into financial infrastructure, allowing anyone to trade, lend, and build protocols without intermediaries"
EOA (Externally Owned Account): A user-controlled account identified by a private key, as opposed to a contract account. "The adversary controls one or more unprivileged \acp{EOA}, can deploy contracts, and observes the information available under our system model"
EVM (Ethereum Virtual Machine): The execution environment and semantics for smart contracts on Ethereum-like chains. "On \ac{EVM}-compatible chains, this behavior is specified by the \ac{EVM} execution semantics"
Execution semantics: The formal rules defining how the EVM processes opcodes and state changes. "this behavior is specified by the \ac{EVM} execution semantics"
Execution trace: A detailed record of the steps (calls, opcodes, effects) taken during a transaction’s execution. "including transaction receipts, logs, and detailed execution traces."
Flash loan: A loan taken and repaid within a single transaction, often used to manipulate on-chain conditions. "a flash-loan-driven swap sequence manipulates prices"
Forked chain state: A local simulation environment that replicates chain state at a specific block for deterministic replay. "a self-contained Foundry \ac{PoC} that reproduces the incident on a forked chain state"
Foundry: A smart contract development and testing framework used to build and run PoCs on forked state. "implements a self-contained Foundry project that forks chain state at the incident block"
Generalized attack imitation: Synthesizing semantically similar bundles to replicate or counter observed attacks. "TxRay is the first \ac{LLM}-based imitation tool and synthesizes profitable \ac{ACT} \ac{MEV} bundles that imitate historical attacks."
Governance threshold: A protocol-defined limit whose crossing triggers governance changes or actions. "a governance threshold is crossed"
Heimdall: A bytecode analysis tool used for disassembly/decompilation when contract sources are unavailable. "public disassembly and optional decompilation with tools such as Heimdall~\cite{heimdall_rs}"
IDS (Intrusion-Detection System): Monitoring systems that detect anomalies or attacks on-chain. "First, intrusion-detection systems (IDS) (e.g., protocol-specific detectors, on-chain monitoring platforms, and incident dashboards) signal that an incident is in progress."
Liquidations: Mechanisms where undercollateralized positions are closed by selling collateral to repay debt. "Some are classical \ac{MEV} (arbitrage, liquidations, generalized front-/back-runs)"
Liveness properties: Protocol guarantees that desired actions eventually happen (e.g., withdrawals remain possible). "\ac{DeFi} attacks are sequences of on-chain actions that violate protocol safety or liveness properties"
LLM: A neural model used here as an agent to analyze and synthesize exploit reconstructions. "We present TxRay, a \ac{LLM} agentic postmortem system that uses tool calls to reconstruct live \ac{ACT} attacks"
MEV (Maximal Extractable Value): Value captured by optimizing transaction ordering and inclusion. "value-redistribution opportunities commonly referred to as \ac{MEV}"
Mempool: The set of pending transactions visible before inclusion in a block. "observable to the adversary via public infrastructure (e.g., the mempool or builder/relay feeds)"
Non-monetary predicate: A publicly checkable exploit effect that isn’t direct profit, such as breaking invariants or freezing assets. "a non-monetary predicate, where there exists a deterministic, publicly checkable safety/liveness predicate $O$ such that $O(\sigma_B,\sigma') = 1$ "
Opcode: The low-level EVM instruction executed by the virtual machine. "including call graphs and per-opcode effects"
Oracle (semantic oracle): Executable assertions that encode incident-specific correctness conditions for PoCs. "TxRay self-checks postmortems by encoding incident-specific semantic oracles as executable assertions that the synthesized \ac{PoC} must satisfy."
Parameter-free: Designed without hard-coded attacker-specific parameters or addresses, enabling self-contained reproduction. "TxRay analyzes the root cause and generates parameter-free \ac{PoC} tests in Foundry"
Phishing: Off-chain deception used to trick users into sending funds or signing transactions; excluded from ACT. "We exclude phishing and social-engineering attacks (e.g., address poisoning \cite{guan2024characterizing,ye2024interface,tsuchiya2025blockchain})"
Pinned on-chain fork: A deterministic fork set to a specific block height or hash to ensure reproducibility. "Does the \ac{PoC} run on a pinned on-chain fork (not only local mocks)?"
PoC (Proof of Concept): An executable artifact that reproduces the exploit deterministically, validating the root cause. "and generates a runnable, self-contained \ac{PoC} that deterministically reproduces the incident."
Postmortem: A reconstruction and analysis of an incident to explain root cause and reproduce effects. "We present TxRay, an \ac{LLM}-based postmortem system"
Pre-state: The reconstructible on-chain state at a given block from which an opportunity is executed. "let $\sigma_B$ denote the pre-state reconstructible from canonical on-chain data"
Profit predicate: A formal condition declaring that an adversary’s net value increased after fees in a reference asset. "a profit predicate, where the adversary's net portfolio value in a fixed reference asset increases after fees"
Private orderflow: Transactions submitted through private channels not visible to the public mempool until inclusion. "without private keys, privileged orderflow, or off-chain agreements."
Replayable artifacts: The transaction-level steps, invariants, and code needed to reproduce and verify an exploit. "their entries are short summaries and do not include the transaction level steps, invariants, or replayable artifacts needed for verification."
RPC (Remote Procedure Call): The interface used by clients to query chain state, traces, and simulate transactions. "clients via public \ac{RPC} endpoints and archive or tracing infrastructure"
Root cause analysis: Evidence-backed reasoning that localizes the vulnerability and mechanism behind an incident. "TxRay performs root cause analysis and synthesizes an executable Foundry-based \ac{PoC}."
Sandwiches: MEV strategies that place transactions before and after a victim’s trade to extract value via price impact. "sandwiches"
State diff: Changes in storage and balances derived from receipts/logs and tracing that show net effects of execution. "balance and storage diffs derived from receipts/logs and state-diff tracing."
Taint analysis: Static/dynamic analysis that tracks data-flow influence to identify attacker-controlled effects. "APE relies on program execution traces and taint analysis."
Transition function (EVM): The formal function mapping pre-state plus transactions to post-state deterministically. "executing $b$ from $\sigma_B$ under the \ac{EVM} transition function deterministically yields a post-state $\sigma'$ "
Unprivileged adversary: An attacker without special permissions or keys who uses only public data/interfaces. "We use ``anyone-can-take'' (\ac{ACT}) for permissionless on-chain exploits an unprivileged adversary can realize using public data and standard interfaces."
Verified source code: Contract source published and validated by explorers, improving transparency and analysis. "When contract source code and \acp{ABI} are published via public explorers (e.g., Etherscan-style services), we assume they are accurate and publicly visible."
Victim transaction: A legitimate user-initiated on-chain action observed and potentially exploited by an adversary. "when we refer to a ``victim transaction'' below, we mean an intended on-chain interaction (e.g., a trade, borrow, or liquidation) rather than a transaction induced by off-chain deception (phishing)."
prestateTracer: A specific tracer used with debug_traceTransaction to obtain pre-state-aware execution details. "public \ac{RPC} endpoints (e.g., eth_getTransactionReceipt and debug_traceTransaction with prestateTracer)"

TxRay: Agentic Postmortem of Live Blockchain Attacks

Summary

Agentic Reconstruction of Blockchain ATT&CKs: An Academic Review of "TxRay: Agentic Postmortem of Live Blockchain Attacks"

Introduction

Problem Statement and Motivation

TxRay System Design

Root Cause Analysis

PoC Generation and Validation

PoCEvaluator for Agentic Reproduction Assessment

Empirical Results: Latency, Cost, and Component Bottlenecks

Live Pipeline and Ablation Analysis

Generalized Attack Imitation and Dataset Curation

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

The big questions the paper asks

How did the researchers approach it?

What did they find?

Why it matters and what could happen next

Knowledge Gaps

Practical Applications

Immediate Applications

Finance and Web3 Security

Software Engineering and DevOps

Academia and Education

Policy, Compliance, and Governance

End-User and Community Tools

Long-Term Applications

Autonomic Real-Time Defense and Governance

Cross-Ecosystem Generalization

Formal Methods and Program Analysis Integration

Market Infrastructure, Insurance, and Risk

Education and Workforce Development

Global Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

Tweets