CRUST-Bench: Automated C-to-safe-Rust Transpilation
- CRUST-Bench is a comprehensive benchmark for evaluating automated transpilation from legacy C code to safe, idiomatic Rust.
- It features 100 curated, multi-file C repositories, each paired with manually crafted Rust interfaces and translated test cases.
- Its rigorous protocol—enforcing strict interface adherence, compilation success, and full test pass—highlights current LLM limitations and potential improvements.
CRUST-Bench is a comprehensive, open benchmark and methodology for evaluating automated C-to-safe-Rust transpilation in the context of legacy software migration, memory safety, and large-scale software modernization. It uniquely targets end-to-end translation of real-world, multi-file C projects into idiomatic, memory-safe Rust—enforced by strict interface adherence and exhaustive test harnesses—offering a rigorous, reproducible framework for assessing both translation systems and the LLMs that increasingly drive them (Khatry et al., 21 Apr 2025).
1. Motivation and Benchmark Scope
Modernizing C code for safety-critical infrastructure requires transforming legacy C programs—frequently error-prone due to unchecked memory abstractions—into equivalents in Rust, a language with compile-time guarantees for memory safety. Prior datasets (e.g., HumanEval, TransCoder) are limited to toy or single-function tasks and often permit unsafe blocks or FFI calls, thus failing to address:
- Project-level dependencies and multi-file abstractions.
- Interface-level constraints enforcing safe Rust idioms.
- Semantic correctness validated by test harnesses.
CRUST-Bench provides 100 real-world C repositories (average 958 LOC, multi-file, each with tests), paired with:
- Explicit, manually authored safe Rust interfaces specifying signatures, types, and ownership semantics.
- Rust test cases translated from their C analogs, exercising and validating the transpiled code under Rust’s borrow and type systems. This framework requires conformance to four constraints: correct interface implementation, successful compilation under Rust’s borrow checker, absence of unsafe or FFI calls at the interface, and functional correctness as enforced by full test suite execution.
2. Repository Curation and Annotation Pipeline
Candidate repositories derive from Github (2005–2025) and are filtered by:
- Language: Pure C, multi-file, with at least one instance of dynamic memory, and buildable cross-platform via Make/CMake.
- Testability: Presence of an original C test suite ensuring code coverage (mean 76.4 tests/repo, 67% average coverage).
- Portability: Successful compilation under GCC 11.4.0 and Clang 14.0.0 (x86/x64).
Each C repo is manually annotated by expert annotators:
- Interface Extraction: All custom types (structs/enums) are mapped into Rust types with explicit ownership/borrowing, and function signatures provided, each filled with
unimplemented!()to validate syntactic/semantic form viarustc. - Test Transcription: C test logic is ported to Rust, targeting the interface and invoking functions to ensure coverage. Pilot validation on a 20-benchmark subset further stress-tests the pipeline’s completeness.
| Statistic | Average per Repo | Maximum Observed |
|---|---|---|
| LOC | 958 | 25,436 |
| Tests | 76.4 | 952 |
| Functions | 34.6 | 418 |
| Pointer Deref | 264 | 12,664 |
| Interface Files | 3.0 | 21 |
| Interface Functions | 30.9 | 415 |
| Function Arguments | 57.2 | (not given) |
3. Rust Interface Structure and Principles
Formally, for a C repo , the Rust target is , and the interface is defined as
with (custom datatypes) and (function signatures) for all project files. Design mandates:
- No
unsafeblocks or FFI/libc usage in the interface itself. - Ownership via
&T,&mut T,Vec<u8>, idiomatic Rust module pathing, and naming conventions. - Strict signature and borrow checking, enforced by
cargo checkand typechecker pass.
Empirically:
- 56% of interface functions receive reference arguments; 30% use
&mut. - 44% of arguments and 50% of return values leverage custom types.
4. Test Case Construction and Validation
Functional correctness is established by rigorous Rust test harnesses:
- All interface functions are covered by ported
#[test]units. - Tests may introduce
unsafeblocks for validation, but the interfaces and code under test must besafe. - Validation sequentially applies
cargo check(interface),cargo build(post-translation), andcargo test(test harness execution) for all implementations. - Continuous integration automates the execution and result reporting.
5. Evaluation Protocols and Metrics
Transpiler (or LLM output) is considered correct if three criteria are simultaneously met:
- Interface Equivalence: All signatures, types, and module/file structures exactly match .
- Compilation: The generated Rust compiles under
rustcand passes borrow checking. - Test Pass: All provided tests, , pass.
LLM prompting scenarios tested:
- Single-Shot (pass@1): Greedy, decode of the entire codebase.
- Repair Loops: Up to 3 rounds of repair, using compiler (
cargo build) diagnostics or test-failure outputs. - Agentic Pipeline: STE-driven edit-compile-test loops (SWE-agent), with resource-budget termination.
Metrics reported:
- Build Success Rate: % of repos where
cargo buildis successful. - Test Pass Rate: % with all tests passing (strict pass@1).
- Repair Improvement: Incremental gain post repair-agent intervention.
6. Experimental Results and Error Taxonomy
CRUST-Bench exposes the limitations of current LLM-based transpilation for C→safe-Rust at the repository level:
| Model | Build (pass@1) | Test (pass@1) | Build (Comp. Repair) | Test (Comp. Repair) | Test (Test Repair) |
|---|---|---|---|---|---|
| o1 | 32% | 15% | 69% | 28% | 37% |
| Claude 3.7 | 26% | 13% | 54% | 23% | 32% |
| Claude 3.5 | 26% | 11% | ... | ... | ... |
| o1 mini | 19% | 9% | ... | ... | ... |
| GPT-4o | 18% | 7% | ... | ... | ... |
Pipelined repair agents (SWE-agent) do not outperform test-repair loops. Notably, even the strongest model (o1+Test repair) achieves only a 37% test pass rate.
Major error modes (based on rustc clustering):
- Type mismatches (e.g., passing
Vec<u8>for&[u8]) - Borrow errors (e.g., mutable/immutable conflicts)
- Missing imports/out-of-scope variables
- Unimplemented bodies (token exhaustion)
- Trait derivations (e.g., missing
Clone/Debug) - Argument count mistakes
- Rogue
unsafe(explicitly forbidden and rare due to prompt design)
Repair loops are most effective for type/borrow and trait errors, with diminishing returns on incomplete implementations constrained by token limits.
7. Insights, Limitations, and Recommendations
CRUST-Bench reveals that current LLMs and automated transpilers fall significantly short of reliable, idiomatic C→safe-Rust translation at the project level. Key insights:
- Explicit, formally-verified Rust interfaces anchor the code generation process.
- Token budget and context limitations frequently truncate multi-file implementations; finer-grained or hierarchical prompting appears necessary.
- Integrating static analysis, such as ownership and lifetime reasoning or compiler IR feedback, into repair loops is likely to reduce type-system errors.
- Pipelined agentic workflows are not inherently superior to targeted repair if lacking deep domain-specific strategies.
CRUST-Bench thus establishes a rigorous, objective baseline for progress in code transpilation, directly connecting advances to software security, maintainability, and the acceleration of legacy systems migration (Khatry et al., 21 Apr 2025).