End-to-End LLM Decompilation

Updated 21 January 2026

End-to-end LLM decompilation is a neural translation process that converts binary or assembly code into human-readable high-level source languages.
It employs transformer-based models to recover control-flow, variable names, and type information, improving readability and re-executability over traditional decompilers.
Recent methods integrate structure-augmented pipelines and in-context learning to tackle challenges such as compiler optimizations and semantic gaps.

End-to-end LLM decompilation is the task of directly translating a raw binary or disassembled assembly code into high-level source representations, such as C or Solidity, using transformer-based neural models. Unlike traditional decompilers—which rely on deterministic heuristic pipelines, static analyses, or handcrafted pattern-matching—LLM-powered decompilation is posed as a conditional generation problem, in which an LLM receives a linearized view of the low-level code and emits human-readable, compilable source. This paradigm is now at the center of a rapidly expanding research field, encompassing architectures, data and preprocessing pipelines, ground-truth recovery, static-dynamic hybridization, and benchmark methodology. The following sections enumerate key methodologies, challenges, tools, pipelines, quantitative findings, and forward-looking limits according to the arXiv literature.

1. Problem Formulation and Core Challenges

End-to-end LLM decompilation is defined as a mapping $M: B \mapsto C^\ast$ , where $B$ is a binary artifact (bytecode, assembly, or stripped code) and $C^\ast$ is the reconstructed high-level equivalent in the target language. Unlike prior tools, end-to-end LLM decompilers use neural text generation without tight hand-engineered constraints (Tan et al., 2024).

Key challenges include:

Semantic gap: Compilation is a lossy process, erasing control/data-flow cues, variable names, type annotations, and user-defined type definitions.
Control-flow ambiguity: Direct assembly streams encode branches and loops as address-relative jumps, which are challenging to reverse into structured constructs without additional context (Feng et al., 17 Feb 2025).
Variable and literal recovery: Data and constant values are frequently encoded outside the code (.rodata), removed, or merged, impeding their accurate inference (Feng et al., 17 Feb 2025).
Compiler optimization artifacts: Aggressive optimizations (inlining, loop unrolling, temporaries coalescence) flatten or obliterate high-level constructs, increasing the complexity of semantic recovery (Wang et al., 3 Nov 2025).
Identifier and type resolution: Determining user-defined types, composite data structures, and function/field naming is highly nontrivial when reconstructing from binary-only views (Dramko et al., 6 Feb 2025, Tan et al., 26 Sep 2025).

2. Model Architectures, Data Pipelines, and Input Processing

A typical end-to-end LLM decompilation pipeline consists of data collection, binary/assembly preprocessing, prompt construction, and neural decoding. Differences in methodology focus on the integration of control/data-flow information, symbolic recovery, context, or staged architectures.

2.1 Raw Translation Approaches

LLM4Decompile exemplifies straight-through, sequence-to-sequence translation models that map linearized assembly tokens (from objdump or similar) directly to C via autoregressive decoding (Tan et al., 2024). The model is trained on pairs of assembly and source at multiple optimization levels, relying on prompt templates that indicate, e.g., the specific optimization flag for each example.

2.2 Structure-Augmented Methods

Methods such as ReF Decompile inject symbolic information through (a) relabeling (substituting all jump and data addresses with symbolic labels to preserve control-flow relations), and (b) a structured function call interface for explicit data extraction (literal recovery from .rodata guided by on-the-fly tool queries) (Feng et al., 17 Feb 2025).

SALT4Decompile constructs a Source-level Abstract Logic Tree (SALT), abstracting the CFG into a rooted, ordered tree to explicitly model structured logic (loops, conditionals); instruction normalization and nested loop detection are core steps before serializing SALT as the prompt for the LLM (Wang et al., 18 Sep 2025).

2.3 Two-Phase Decoupled Pipelines

SK²Decompile introduces a split architecture: (1) a Structure Recovery model produces an IR capturing control-flow and data layouts with anonymized placeholders, and (2) an Identifier Naming stage that maps IR to source code with semantically meaningful names (Tan et al., 26 Sep 2025). Each model is optimized for structure or readability separately with RL-style objectives, and inference proceeds via explicit IR factorization.

2.4 Joint Code and Type Recovery

Idioms proposes joint sequence generation, where the LLM emits both the reconstructed source code and all referenced user-defined types, ensuring consistency in field, type, and function definitions (Dramko et al., 6 Feb 2025). Neighboring function traces are included in the prompt to address the scattered-evidence problem in type reconstruction.

2.5 Modular Context-Augmented and Hybrid Models

ICL4Decomp integrates in-context learning with retrieval-based few-shot exemplars or optimization-aware rule descriptors injected into the prompt (Wang et al., 3 Nov 2025). The context may consist of retrieved assembly/source pairs from a large database (retrieval-based ICL), or natural-language summaries of compiler flags and their likely effects (rule-based ICL).

3. Datasets, Ground-truth Construction, and Evaluation Metrics

Advances in LLM decompilers are inseparable from dataset creation and measurement standards.

3.1 Scale and Quality Benchmarks

Decompile-Bench delivers 2 M binary–source pairs cleaned from 100 M function extractions across permissively licensed C/C++ repos, spanning all major optimization levels and deduplicated for maximum ground-truth fidelity (Tan et al., 19 May 2025).
Realtype focuses on realistic type complexity and nested user-defined data layouts to expand the code–type mapping challenge (Dramko et al., 6 Feb 2025).
DecompileBench involves 23,400 functions from real OSS-Fuzz projects recompiled into function-level shared objects for robust, runtime-aware validation (Gao et al., 16 May 2025).

3.2 Evaluation and Metrics

Key metrics include:

Re-executability Rate ( $R$ ): Fraction of decompiled programs that can be recompiled and pass all original tests, providing a functional measure:

$R = \frac{\#\{\text{decompiled functions }d \mid d\text{ passes all tests }T\}}{\#\{\text{total test functions}\}}$

Relative Readability Index (R2I): [0,1] structural readability score derived from AST features.
Edit Similarity: $1 - \frac{\text{Levenshtein}(s,d)}{\max(|s|, |d|)}$ between source and decompiled code.
Coverage Equivalence Rate (CER): Branch-coverage equivalence for runtime-correctness (Gao et al., 16 May 2025).
Elo-based code quality: Pairwise ranking of decompiler output by LLM-judge with 12 criteria (control flow, naming, memory layout).

4. Methodological Innovations and Comparative Results

4.1 Control-flow and Data-flow Augmentation

Label injection (ReF Decompile, SALT4Decompile) increases clarity and recoverability of structured constructs, boosting re-executability by up to 15 percentage points over plain assembly input (Feng et al., 17 Feb 2025, Wang et al., 18 Sep 2025).
Logic tree abstraction (SALT) allows the LLM to internalize nested block structure and precise jump relationships, translating to robust performance under obfuscation and complex branching (Wang et al., 18 Sep 2025).

4.2 Type, Struct, and Symbol Naming

Joint code and type prediction (Idioms) and decoupled identifier recovery (SK²Decompile) empirically outperform flat function-only baselines. SK²Decompile achieves a 69.0% average re-executability on HumanEval (O0–O3) and a 29.4% relative R2I gain over Idioms on GitHub2025 (Tan et al., 26 Sep 2025).
Neighbor-context windows in Idioms—incorporating interprocedural callers and callees—yield a 63% increase in UDT composition accuracy (Dramko et al., 6 Feb 2025).

4.3 Prompt Engineering and In-Context Learning

Rule-based guidance, as in ICL4Decomp, injects explicit summaries of optimization flags and their compilation impact, which, combined with semantically matched exemplars, increases re-executability rates by 40% over plain LLM decoding, especially at high optimization levels (O2, O3) (Wang et al., 3 Nov 2025).

4.4 Multi-Stage Postprocessing

SALT4Decompile and others apply postprocessing cascading over:

Compilation-error repair (feeding error logs into small LLMs for iterative fixing).
Boundary correction (fixing off-by-one loop errors).
Symbol renaming and comment insertion based on external LLMs (e.g., Claude-3.5-Sonnet). These stages further raise executability, readability, and semantic fidelity (Wang et al., 18 Sep 2025).

5. Cross-Domain and Specialized Decompilation

LLM decompilers have been extended from native C/C++ binaries to new domains:

WebAssembly: WaDec and StackSight employ loop-slicing, stack-symbolic analysis, and chain-of-thought prompting to generate high-fidelity C or C++ code from Wasm. WaDec achieves a recompilability rate of 52.11% and re-execution rate of 43.55%, outperforming Ghidra by close to two orders of magnitude (She et al., 2024, Fang et al., 2024).
Smart Contracts: SmartHalo augments EVM–Solidity decompilation by building fine-grained dependency graphs, synthesizing tailored prompts for LLM-based recovery of method boundaries, types, and attributes, verified symbolically (precision: method boundaries 87.39%, variable types 90.39%) (Liao et al., 15 Jan 2025).

6. Limitations, Comparative Tradeoffs, and Future Directions

Current LLM-based end-to-end decompilers, while advancing readability and partial semantic recovery, often trail in functional correctness versus commercial or static tools; for instance, DecompileBench finds LLM-generated code is more understandable for analysts, but with 52.2% lower functional correctness rates than rule-based decompilers (Gao et al., 16 May 2025). Common failures include incorrect type inference, incomplete struct recovery, and logic errors in high-optimization targets (Tan et al., 19 May 2025, Tan et al., 26 Sep 2025).

Major open challenges:

Scaling to aggressive optimizations and diverse platforms (ARM, embedded ISAs).
Integrating dynamic analysis, formal verification, or self-corrective feedback.
Extending multi-function and interprocedural recovery, especially for whole-program binaries (Gao et al., 16 May 2025, Tan et al., 26 Sep 2025).
Automatic error localization and iterative repair loops via compiler feedback (Wang et al., 18 Sep 2025).
Closing the semantic gap between neural plausibility and ground-truth logic fidelity (Gao et al., 16 May 2025).

7. Representative Pipelines and Results: Summary Table

Approach	Control-Flow Recovery	Type/Identifier Recovery	Re-executability (%)	Domain
LLM4Decompile (Tan et al., 2024)	None	No	21.4 (O0–O3 avg)	Native C
ReF Decompile (Feng et al., 17 Feb 2025)	Relabeling	Tool-based (rodata)	61.43	Native C
Idioms (Dramko et al., 6 Feb 2025)	Call graph context	Joint code+types	54.4 (ExeBench)	Native C
SK²Decompile (Tan et al., 26 Sep 2025)	Two-phase IR	RL identifier naming	69.0 (HumanEval)	Native C
SALT4Decompile (Wang et al., 18 Sep 2025)	SALT logic tree	Postproc symbol recovery	58.7 (Decompile-Eval)	Native C
WaDec (She et al., 2024)	Loop slicing	String/var renaming	43.55 (Wasm)	WebAssembly
StackSight (Fang et al., 2024)	Stack tracking	CoT prompting	31.4 (HumanEval-X)	WebAssembly
SmartHalo (Liao et al., 15 Jan 2025)	Dependency graph	LLM semantic recovery	60.2	Solidity/EVM
ICL4Decomp (Wang et al., 3 Nov 2025)	In-context retrieval	Prompt-based	40.2–54.3	Native C

This table highlights the diversity of strategies focusing on either control-flow clarity, identifier fidelity, or advanced contextualization—each of which results in quantifiable improvements over prior art in appropriate metrics.

In summary, end-to-end LLM decompilation has evolved to adopt hybrid, pipeline, and prompt-augmented architectures that explicitly encode structural, data-flow, and contextual cues for neural translation. While recent methods such as SK²Decompile and ReF Decompile push re-executability rates above 60% on challenging suites, full semantic correctness and generalizability remain open research problems, suggesting future advances through richer datasets, error-corrective loops, and further integration of symbolic and neural methods.