CRUST-Bench: Automated C-to-safe-Rust Transpilation

Updated 21 February 2026

CRUST-Bench is a comprehensive benchmark for evaluating automated transpilation from legacy C code to safe, idiomatic Rust.
It features 100 curated, multi-file C repositories, each paired with manually crafted Rust interfaces and translated test cases.
Its rigorous protocol—enforcing strict interface adherence, compilation success, and full test pass—highlights current LLM limitations and potential improvements.

CRUST-Bench is a comprehensive, open benchmark and methodology for evaluating automated C-to-safe-Rust transpilation in the context of legacy software migration, memory safety, and large-scale software modernization. It uniquely targets end-to-end translation of real-world, multi-file C projects into idiomatic, memory-safe Rust—enforced by strict interface adherence and exhaustive test harnesses—offering a rigorous, reproducible framework for assessing both translation systems and the LLMs that increasingly drive them (Khatry et al., 21 Apr 2025).

1. Motivation and Benchmark Scope

Modernizing C code for safety-critical infrastructure requires transforming legacy C programs—frequently error-prone due to unchecked memory abstractions—into equivalents in Rust, a language with compile-time guarantees for memory safety. Prior datasets (e.g., HumanEval, TransCoder) are limited to toy or single-function tasks and often permit unsafe blocks or FFI calls, thus failing to address:

Project-level dependencies and multi-file abstractions.
Interface-level constraints enforcing safe Rust idioms.
Semantic correctness validated by test harnesses.

CRUST-Bench provides 100 real-world C repositories (average 958 LOC, multi-file, each with tests), paired with:

Explicit, manually authored safe Rust interfaces specifying signatures, types, and ownership semantics.
Rust test cases translated from their C analogs, exercising and validating the transpiled code under Rust’s borrow and type systems. This framework requires conformance to four constraints: correct interface implementation, successful compilation under Rust’s borrow checker, absence of unsafe or FFI calls at the interface, and functional correctness as enforced by full test suite execution.

2. Repository Curation and Annotation Pipeline

Candidate repositories derive from Github (2005–2025) and are filtered by:

Language: Pure C, multi-file, with at least one instance of dynamic memory, and buildable cross-platform via Make/CMake.
Testability: Presence of an original C test suite ensuring code coverage (mean 76.4 tests/repo, 67% average coverage).
Portability: Successful compilation under GCC 11.4.0 and Clang 14.0.0 (x86/x64).

Each C repo is manually annotated by expert annotators:

Interface Extraction: All custom types (structs/enums) are mapped into Rust types with explicit ownership/borrowing, and function signatures provided, each filled with unimplemented!() to validate syntactic/semantic form via rustc.
Test Transcription: C test logic is ported to Rust, targeting the interface and invoking functions to ensure coverage. Pilot validation on a 20-benchmark subset further stress-tests the pipeline’s completeness.

Statistic	Average per Repo	Maximum Observed
LOC	958	25,436
Tests	76.4	952
Functions	34.6	418
Pointer Deref	264	12,664
Interface Files	3.0	21
Interface Functions	30.9	415
Function Arguments	57.2	(not given)

3. Rust Interface Structure and Principles

Formally, for a C repo $\mathbf{S} = \{S_1, ..., S_n\}$ , the Rust target is $\mathbf{R} = \{R_1, ..., R_n\}$ , and the interface is defined as

$I = \bigl((s_1,\dots,s_n),\;(f_1,\dots,f_m)\bigr)$

with $s_i$ (custom datatypes) and $f_j$ (function signatures) for all project files. Design mandates:

No unsafe blocks or FFI/libc usage in the interface itself.
Ownership via &T, &mut T, Vec<u8>, idiomatic Rust module pathing, and naming conventions.
Strict signature and borrow checking, enforced by cargo check and typechecker pass.

Empirically:

56% of interface functions receive reference arguments; 30% use &mut.
44% of arguments and 50% of return values leverage custom types.

4. Test Case Construction and Validation

Functional correctness is established by rigorous Rust test harnesses:

All interface functions are covered by ported #[test] units.
Tests may introduce unsafe blocks for validation, but the interfaces and code under test must be safe.
Validation sequentially applies cargo check (interface), cargo build (post-translation), and cargo test (test harness execution) for all implementations.
Continuous integration automates the execution and result reporting.

5. Evaluation Protocols and Metrics

Transpiler (or LLM output) $\mathbf{R}$ is considered correct if three criteria are simultaneously met:

Interface Equivalence: All signatures, types, and module/file structures exactly match $I$ .
Compilation: The generated Rust compiles under rustc and passes borrow checking.
Test Pass: All provided tests, $t_1, ..., t_k$ , pass.

LLM prompting scenarios tested:

Single-Shot (pass@1): Greedy, $T=0$ decode of the entire codebase.
Repair Loops: Up to 3 rounds of repair, using compiler (cargo build) diagnostics or test-failure outputs.
Agentic Pipeline: STE-driven edit-compile-test loops (SWE-agent), with resource-budget termination.

Metrics reported:

Build Success Rate: % of repos where cargo build is successful.
Test Pass Rate: % with all tests passing (strict pass@1).
Repair Improvement: Incremental gain post repair-agent intervention.

6. Experimental Results and Error Taxonomy

CRUST-Bench exposes the limitations of current LLM-based transpilation for C→safe-Rust at the repository level:

Model	Build (pass@1)	Test (pass@1)	Build (Comp. Repair)	Test (Comp. Repair)	Test (Test Repair)
o1	32%	15%	69%	28%	37%
Claude 3.7	26%	13%	54%	23%	32%
Claude 3.5	26%	11%	...	...	...
o1 mini	19%	9%	...	...	...
GPT-4o	18%	7%	...	...	...

Pipelined repair agents (SWE-agent) do not outperform test-repair loops. Notably, even the strongest model (o1+Test repair) achieves only a 37% test pass rate.

Major error modes (based on rustc clustering):

Type mismatches (e.g., passing Vec<u8> for &[u8])
Borrow errors (e.g., mutable/immutable conflicts)
Missing imports/out-of-scope variables
Unimplemented bodies (token exhaustion)
Trait derivations (e.g., missing Clone/Debug)
Argument count mistakes
Rogue unsafe (explicitly forbidden and rare due to prompt design)

Repair loops are most effective for type/borrow and trait errors, with diminishing returns on incomplete implementations constrained by token limits.

7. Insights, Limitations, and Recommendations

CRUST-Bench reveals that current LLMs and automated transpilers fall significantly short of reliable, idiomatic C→safe-Rust translation at the project level. Key insights:

Explicit, formally-verified Rust interfaces anchor the code generation process.
Token budget and context limitations frequently truncate multi-file implementations; finer-grained or hierarchical prompting appears necessary.
Integrating static analysis, such as ownership and lifetime reasoning or compiler IR feedback, into repair loops is likely to reduce type-system errors.
Pipelined agentic workflows are not inherently superior to targeted repair if lacking deep domain-specific strategies.

CRUST-Bench thus establishes a rigorous, objective baseline for progress in code transpilation, directly connecting advances to software security, maintainability, and the acceleration of legacy systems migration (Khatry et al., 21 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (1)

CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CRUST-Bench.