End-to-End LLM Decompilation
- End-to-end LLM decompilation is a neural translation process that converts binary or assembly code into human-readable high-level source languages.
- It employs transformer-based models to recover control-flow, variable names, and type information, improving readability and re-executability over traditional decompilers.
- Recent methods integrate structure-augmented pipelines and in-context learning to tackle challenges such as compiler optimizations and semantic gaps.
End-to-End LLM Decompilation
End-to-end LLM decompilation is the task of directly translating a raw binary or disassembled assembly code into high-level source representations, such as C or Solidity, using transformer-based neural models. Unlike traditional decompilersāwhich rely on deterministic heuristic pipelines, static analyses, or handcrafted pattern-matchingāLLM-powered decompilation is posed as a conditional generation problem, in which an LLM receives a linearized view of the low-level code and emits human-readable, compilable source. This paradigm is now at the center of a rapidly expanding research field, encompassing architectures, data and preprocessing pipelines, ground-truth recovery, static-dynamic hybridization, and benchmark methodology. The following sections enumerate key methodologies, challenges, tools, pipelines, quantitative findings, and forward-looking limits according to the arXiv literature.
1. Problem Formulation and Core Challenges
End-to-end LLM decompilation is defined as a mapping , where is a binary artifact (bytecode, assembly, or stripped code) and is the reconstructed high-level equivalent in the target language. Unlike prior tools, end-to-end LLM decompilers use neural text generation without tight hand-engineered constraints (Tan et al., 2024).
Key challenges include:
- Semantic gap: Compilation is a lossy process, erasing control/data-flow cues, variable names, type annotations, and user-defined type definitions.
- Control-flow ambiguity: Direct assembly streams encode branches and loops as address-relative jumps, which are challenging to reverse into structured constructs without additional context (Feng et al., 17 Feb 2025).
- Variable and literal recovery: Data and constant values are frequently encoded outside the code (.rodata), removed, or merged, impeding their accurate inference (Feng et al., 17 Feb 2025).
- Compiler optimization artifacts: Aggressive optimizations (inlining, loop unrolling, temporaries coalescence) flatten or obliterate high-level constructs, increasing the complexity of semantic recovery (Wang et al., 3 Nov 2025).
- Identifier and type resolution: Determining user-defined types, composite data structures, and function/field naming is highly nontrivial when reconstructing from binary-only views (Dramko et al., 6 Feb 2025, Tan et al., 26 Sep 2025).
2. Model Architectures, Data Pipelines, and Input Processing
A typical end-to-end LLM decompilation pipeline consists of data collection, binary/assembly preprocessing, prompt construction, and neural decoding. Differences in methodology focus on the integration of control/data-flow information, symbolic recovery, context, or staged architectures.
2.1 Raw Translation Approaches
LLM4Decompile exemplifies straight-through, sequence-to-sequence translation models that map linearized assembly tokens (from objdump or similar) directly to C via autoregressive decoding (Tan et al., 2024). The model is trained on pairs of assembly and source at multiple optimization levels, relying on prompt templates that indicate, e.g., the specific optimization flag for each example.
2.2 Structure-Augmented Methods
Methods such as ReF Decompile inject symbolic information through (a) relabeling (substituting all jump and data addresses with symbolic labels to preserve control-flow relations), and (b) a structured function call interface for explicit data extraction (literal recovery from .rodata guided by on-the-fly tool queries) (Feng et al., 17 Feb 2025).
SALT4Decompile constructs a Source-level Abstract Logic Tree (SALT), abstracting the CFG into a rooted, ordered tree to explicitly model structured logic (loops, conditionals); instruction normalization and nested loop detection are core steps before serializing SALT as the prompt for the LLM (Wang et al., 18 Sep 2025).
2.3 Two-Phase Decoupled Pipelines
SK²Decompile introduces a split architecture: (1) a Structure Recovery model produces an IR capturing control-flow and data layouts with anonymized placeholders, and (2) an Identifier Naming stage that maps IR to source code with semantically meaningful names (Tan et al., 26 Sep 2025). Each model is optimized for structure or readability separately with RL-style objectives, and inference proceeds via explicit IR factorization.
2.4 Joint Code and Type Recovery
Idioms proposes joint sequence generation, where the LLM emits both the reconstructed source code and all referenced user-defined types, ensuring consistency in field, type, and function definitions (Dramko et al., 6 Feb 2025). Neighboring function traces are included in the prompt to address the scattered-evidence problem in type reconstruction.
2.5 Modular Context-Augmented and Hybrid Models
ICL4Decomp integrates in-context learning with retrieval-based few-shot exemplars or optimization-aware rule descriptors injected into the prompt (Wang et al., 3 Nov 2025). The context may consist of retrieved assembly/source pairs from a large database (retrieval-based ICL), or natural-language summaries of compiler flags and their likely effects (rule-based ICL).
3. Datasets, Ground-truth Construction, and Evaluation Metrics
Advances in LLM decompilers are inseparable from dataset creation and measurement standards.
3.1 Scale and Quality Benchmarks
- Decompile-Bench delivers 2 M binaryāsource pairs cleaned from 100 M function extractions across permissively licensed C/C++ repos, spanning all major optimization levels and deduplicated for maximum ground-truth fidelity (Tan et al., 19 May 2025).
- Realtype focuses on realistic type complexity and nested user-defined data layouts to expand the codeātype mapping challenge (Dramko et al., 6 Feb 2025).
- DecompileBench involves 23,400 functions from real OSS-Fuzz projects recompiled into function-level shared objects for robust, runtime-aware validation (Gao et al., 16 May 2025).
3.2 Evaluation and Metrics
Key metrics include:
- Re-executability Rate (): Fraction of decompiled programs that can be recompiled and pass all original tests, providing a functional measure:
- Relative Readability Index (R2I): [0,1] structural readability score derived from AST features.
- Edit Similarity: between source and decompiled code.
- Coverage Equivalence Rate (CER): Branch-coverage equivalence for runtime-correctness (Gao et al., 16 May 2025).
- Elo-based code quality: Pairwise ranking of decompiler output by LLM-judge with 12 criteria (control flow, naming, memory layout).
4. Methodological Innovations and Comparative Results
4.1 Control-flow and Data-flow Augmentation
- Label injection (ReF Decompile, SALT4Decompile) increases clarity and recoverability of structured constructs, boosting re-executability by up to 15 percentage points over plain assembly input (Feng et al., 17 Feb 2025, Wang et al., 18 Sep 2025).
- Logic tree abstraction (SALT) allows the LLM to internalize nested block structure and precise jump relationships, translating to robust performance under obfuscation and complex branching (Wang et al., 18 Sep 2025).
4.2 Type, Struct, and Symbol Naming
- Joint code and type prediction (Idioms) and decoupled identifier recovery (SK²Decompile) empirically outperform flat function-only baselines. SK²Decompile achieves a 69.0% average re-executability on HumanEval (O0āO3) and a 29.4% relative R2I gain over Idioms on GitHub2025 (Tan et al., 26 Sep 2025).
- Neighbor-context windows in Idiomsāincorporating interprocedural callers and calleesāyield a 63% increase in UDT composition accuracy (Dramko et al., 6 Feb 2025).
4.3 Prompt Engineering and In-Context Learning
- Rule-based guidance, as in ICL4Decomp, injects explicit summaries of optimization flags and their compilation impact, which, combined with semantically matched exemplars, increases re-executability rates by 40% over plain LLM decoding, especially at high optimization levels (O2, O3) (Wang et al., 3 Nov 2025).
4.4 Multi-Stage Postprocessing
SALT4Decompile and others apply postprocessing cascading over:
- Compilation-error repair (feeding error logs into small LLMs for iterative fixing).
- Boundary correction (fixing off-by-one loop errors).
- Symbol renaming and comment insertion based on external LLMs (e.g., Claude-3.5-Sonnet). These stages further raise executability, readability, and semantic fidelity (Wang et al., 18 Sep 2025).
5. Cross-Domain and Specialized Decompilation
LLM decompilers have been extended from native C/C++ binaries to new domains:
- WebAssembly: WaDec and StackSight employ loop-slicing, stack-symbolic analysis, and chain-of-thought prompting to generate high-fidelity C or C++ code from Wasm. WaDec achieves a recompilability rate of 52.11% and re-execution rate of 43.55%, outperforming Ghidra by close to two orders of magnitude (She et al., 2024, Fang et al., 2024).
- Smart Contracts: SmartHalo augments EVMāSolidity decompilation by building fine-grained dependency graphs, synthesizing tailored prompts for LLM-based recovery of method boundaries, types, and attributes, verified symbolically (precision: method boundaries 87.39%, variable types 90.39%) (Liao et al., 15 Jan 2025).
6. Limitations, Comparative Tradeoffs, and Future Directions
Current LLM-based end-to-end decompilers, while advancing readability and partial semantic recovery, often trail in functional correctness versus commercial or static tools; for instance, DecompileBench finds LLM-generated code is more understandable for analysts, but with 52.2% lower functional correctness rates than rule-based decompilers (Gao et al., 16 May 2025). Common failures include incorrect type inference, incomplete struct recovery, and logic errors in high-optimization targets (Tan et al., 19 May 2025, Tan et al., 26 Sep 2025).
Major open challenges:
- Scaling to aggressive optimizations and diverse platforms (ARM, embedded ISAs).
- Integrating dynamic analysis, formal verification, or self-corrective feedback.
- Extending multi-function and interprocedural recovery, especially for whole-program binaries (Gao et al., 16 May 2025, Tan et al., 26 Sep 2025).
- Automatic error localization and iterative repair loops via compiler feedback (Wang et al., 18 Sep 2025).
- Closing the semantic gap between neural plausibility and ground-truth logic fidelity (Gao et al., 16 May 2025).
7. Representative Pipelines and Results: Summary Table
| Approach | Control-Flow Recovery | Type/Identifier Recovery | Re-executability (%) | Domain |
|---|---|---|---|---|
| LLM4Decompile (Tan et al., 2024) | None | No | 21.4 (O0āO3 avg) | Native C |
| ReF Decompile (Feng et al., 17 Feb 2025) | Relabeling | Tool-based (rodata) | 61.43 | Native C |
| Idioms (Dramko et al., 6 Feb 2025) | Call graph context | Joint code+types | 54.4 (ExeBench) | Native C |
| SK²Decompile (Tan et al., 26 Sep 2025) | Two-phase IR | RL identifier naming | 69.0 (HumanEval) | Native C |
| SALT4Decompile (Wang et al., 18 Sep 2025) | SALT logic tree | Postproc symbol recovery | 58.7 (Decompile-Eval) | Native C |
| WaDec (She et al., 2024) | Loop slicing | String/var renaming | 43.55 (Wasm) | WebAssembly |
| StackSight (Fang et al., 2024) | Stack tracking | CoT prompting | 31.4 (HumanEval-X) | WebAssembly |
| SmartHalo (Liao et al., 15 Jan 2025) | Dependency graph | LLM semantic recovery | 60.2 | Solidity/EVM |
| ICL4Decomp (Wang et al., 3 Nov 2025) | In-context retrieval | Prompt-based | 40.2ā54.3 | Native C |
This table highlights the diversity of strategies focusing on either control-flow clarity, identifier fidelity, or advanced contextualizationāeach of which results in quantifiable improvements over prior art in appropriate metrics.
In summary, end-to-end LLM decompilation has evolved to adopt hybrid, pipeline, and prompt-augmented architectures that explicitly encode structural, data-flow, and contextual cues for neural translation. While recent methods such as SK²Decompile and ReF Decompile push re-executability rates above 60% on challenging suites, full semantic correctness and generalizability remain open research problems, suggesting future advances through richer datasets, error-corrective loops, and further integration of symbolic and neural methods.