Decompile-Bench: Empirical Binary Decompilation
- Decompile-Bench is a large-scale, leakage-resistant dataset paired with an evaluation framework, providing two million binaryāsource function pairs from real-world C/C++ repositories.
- It employs a robust Compile-Trace-Filter pipeline to accurately align binary functions with their source counterparts despite compiler optimizations.
- The accompanying evaluation suite, Decompile-Bench-Eval, enables reproducible benchmarking for LLM-based, statistical, and rule-based decompilers while mitigating data leakage.
Decompile-Bench is a large-scale, leakage-resistant, and highly curated dataset and evaluation framework that establishes a new standard for empirical research in binary decompilation and reverse engineering. Developed to overcome the limitations of synthetic or contest-style decompilation benchmarks, Decompile-Bench provides two million real-world binaryāsource function pairs derived from permissively licensed, production-grade C/C++ GitHub repositories and includes a robust evaluation suite (āDecompile-Bench-Evalā) designed specifically to mitigate data leakage and reflect realistic software artifacts. By facilitating principled, scalable benchmarking for statistical, LLM-based, and traditional rule-based decompilers, Decompile-Bench supports functionally meaningful, semantically rich, and reproducible research in a field critical to software security, malware analysis, and program comprehension (Tan et al., 19 May 2025).
1. Dataset Construction and Filtering Pipeline
Decompile-Bench is built from 3,961 open-source repositories, each containing at least one star and a valid CMakeLists.txt, to ensure both project diversity and minimal build friction. These repositories are compiled at four optimization levels (O0āO3) with debug information enforced, resulting in approximately 450 GB of binaries and an initial pool of 100 million binary functions.
The central challenge is function-level alignment between binaries and their originating source functions, complicated by common compiler transformations such as inlining, dead-code elimination, and reordering. To address this, Decompile-Bench introduces the CTF (Compile-Trace-Filter) pipeline:
- Compile: Projects are compiled under patched Clang drivers enforcing the desired flags. Dependencies are resolved automatically using CMake parser/GPT CLI synthesis, enabling highly automated mass compilation.
- Trace: Each binary function is mapped to candidate source functions by DWARF debugging information. For each binary function in binary , the set of source code lines associated with (func_segment) is identified. Using Tree-sitter, each line is traced to its enclosing source function . The final source match maximizes line overlap with func_segment, as formalized by
$f_s^* = \arg\max_{f_s \in \text{Candidates}} | \text{func_segment} \cap \text{Lines}(f_s) |$
This process repairs fragmented and reordered DWARF mappings.
- Filter: Three filters are successively applied:
- Project-scope: Remove header-/external functions.
- In-binary deduplication: For multiple mappings, retain only the best function overlap.
- Cross-binary deduplication: Employ MinHashāLSH to cluster and remove near-duplicate binaryāsource pairs across the corpus.
These stages reduce the raw collection to a high-quality corpus of 2 million unique binaryāsource function pairs. The released dataset (source + assembly) is compressed to ā30 GB, while the raw binaries occupy ā450 GB (Tan et al., 19 May 2025).
2. Data Properties and Characteristics
The final dataset emphasizes binary functions of moderate complexity (5ā60 lines), optimal for LLM training and evaluation. Comparison of code metrics demonstrates that filtered data better matches real-world software: average cyclomatic complexity rises from 3.3 (raw) to 4.5 (clean), and Halstead difficulty from 11.9 to 19.7āvalues slightly above the broader open-source baseline (cyclomatic ā 3.6; Halstead ā 16.3).
Filtering yields are visualized quantitatively: header-filtering removes 45% of raw pairs, in-binary deduplication 20%, cross-binary deduplication 32%, leaving 2% of pairs in the released corpus. Functions at all four optimization levels are equitably represented. Inlining is not discarded but handled in tracingāif source lines from inlined code appear within another function, overlap maximization ensures correct attribution (Tan et al., 19 May 2025).
3. Leakage-Resistant Evaluation and Held-Out Benchmark
Decompile-Bench-Eval is the evaluation suite derived to provide unbiased, leakage-resistant, and realistic assessment. It contains:
- Manually translated C/C++ ports (plus unit tests) of the HumanEval and MBPP Python programming benchmarks: these are widely used as cross-model task benchmarks, here fully ported and compiled at optimization levels O0āO3.
- The "GitHub2025" branch: all permissively licensed C/C++ repositories created after 2025, with extraneous third-party directories stripped out to eliminate training/test overlap.
This ensures no data overlap between the training corpus and evaluation set, hence mitigating one of the main confounds in LLM decompiler benchmarking (Tan et al., 19 May 2025).
4. Benchmarking Methodology and Metrics
Methodologically, Decompile-Bench supports rigorous multi-axis evaluation, including:
- Re-executability Rate: The fraction of decompiled programs whose behavior matches the reference implementation on all test inputs:
- R2I (Relative Readability Index) [Eom et al. 2024]: Analyzes 31 code features (e.g., naming, parsing, idiomatic constructs) from a functionās AST, normalizes to . High values indicate output approaching human-written code.
- Edit Similarity: Normalized Levenshtein distance, , measuring source-level similarity.
- Auxiliary metrics include CodeSage (embedding similarity) and CodeBLEU.
For LLM benchmarking, the recommended protocol is fine-tuning on the filtered pairs, using greedy decoding for decompilation, and reporting per-function metrics on the held-out test sets (Tan et al., 19 May 2025).
5. Empirical Results and Comparisons
Fine-tuning LLM4Decompile-End (1.3B parameters) on just 10% of Decompile-Bench yields statistically large improvements across all main axes:
| HumanEval (Base) | HumanEval (Decompile-Bench) | Relative Gain | |
|---|---|---|---|
| Re-executability | 16.22% | 20.89% | +28.8% |
| Readability (R2I) | 74.35 | 77.46 | +4.1% |
| Edit similarity | 38.34 | 46.22 | +34.8% |
Similar improvements are seen on MBPP and GitHub2025. For binary-source search (contrastive retrieval), recall@1 ā 27% is achieved, competitive with commercial tools.
Ablation studies confirm that only the carefully filtered Decompile-Bench data improves LLM decompiler performance: using raw (unfiltered) pairs degrades results by up to 11%, and existing datasets like ExeBench yield no gain. This suggests that alignment and deduplication are critical for effective LLM decompiler training (Tan et al., 19 May 2025).
6. Limitations and Recommendations
While Decompile-Bench is the largest public binaryāsource corpus of its kind, several caveats exist:
- Only 10% of the corpus was used for model fine-tuning due to compute constraints; full-corpus pretraining is currently cost-prohibitive.
- Project-scope, whole-binary decompilation is desirable but remains infeasible due to sequence-length and hardware limits.
- Non-permissive (e.g., GPL-only or proprietary) code is excluded for legal reasons; extension would require careful policy and compliance.
- Some compiler transformations (loop unrolling, tail-call optimizations, aggressive interprocedural rewriting) may still foil the trace algorithm and misalign mappings.
- Semantic equivalence checking using symbolic execution is proposed for harder cases.
For reproducibility and best practice, users are advised to employ the full CTF pipeline, replicate debug symbol policies, apply cross-binary deduplication, and benchmark exclusively on the provided held-out suite (Tan et al., 19 May 2025).
7. Applications, Impact, and Extensions
Decompile-Bench provides a foundational resource for advancing LLM-based and classic decompiler methodologies. It serves as both a large-scale pretraining corpus and an unbiased evaluation reference point, enabling research in binaryāsource alignment, function retrieval, type and API inference, cross-compilation search, and decompiler robustness analysis. Its leak-resistant partitioning and real-world program coverage have made it suitable for empirical studies, as in the systematic evaluation of Rust decompilation (Zhou, 24 Jul 2025) and comparative framework analyses featuring DecompileBench and other standards (Gao et al., 16 May 2025).
All data, code, and scripts are publicly released via HuggingFace and GitHub. This open infrastructure is expected to accelerate research in reverse engineering, binary program understanding, and related domains (Tan et al., 19 May 2025).