PR2: Peephole Raw Pointer Rewriting with LLMs for Translating C to Safer Rust

Published 7 May 2025 in cs.SE, cs.AI, and cs.PL | (2505.04852v2)

Abstract: There has been a growing interest in translating C code to Rust due to Rust's robust memory and thread safety guarantees. Tools such as C2RUST enable syntax-guided transpilation from C to semantically equivalent Rust code. However, the resulting Rust programs often rely heavily on unsafe constructs--particularly raw pointers--which undermines Rust's safety guarantees. This paper aims to improve the memory safety of Rust programs generated by C2RUST by eliminating raw pointers. Specifically, we propose a peephole raw pointer rewriting technique that lifts raw pointers in individual functions to appropriate Rust data structures. Technically, PR2 employs decision-tree-based prompting to guide the pointer lifting process. Additionally, it leverages code change analysis to guide the repair of errors introduced during rewriting, effectively addressing errors encountered during compilation and test case execution. We implement PR2 as a prototype and evaluate it using gpt-4o-mini on 28 real-world C projects. The results show that PR2 successfully eliminates 13.22% of local raw pointers across these projects, significantly enhancing the safety of the translated Rust code. On average, PR2 completes the transformation of a project in 5.44 hours, at an average cost of $1.46.

Abstract PDF Upgrade to Chat

Summary

Peephole Raw Pointer Rewriting with LLMs for Translating C to Safer Rust

The paper "PR²: Peephole Raw Pointer Rewriting with LLMs for Translating C to Safer Rust" addresses the significant challenge of converting C code into Rust, focusing on enhancing memory safety through the elimination of raw pointers. Despite Rust's robust memory and thread safety capabilities, existing tools like C2Rust often produce Rust code with unsafe constructs, notably raw pointers, which undermines safety. This research proposes a novel peephole raw pointer rewriting technique that lifts raw pointers within individual functions to safer Rust data structures, significantly improving the safety of Rust programs generated from C code.

Motivation and Approach

The use of C language in system software is widespread due to its efficiency and low-level control capabilities. However, the lack of inherent memory safety in C leads to vulnerabilities, such as the infamous Heartbleed bug. Conversely, Rust offers strong memory safety guarantees via its ownership model and borrow checker, prompting a shift towards Rust in many critical software projects. Translating C code to Rust is a complex task that requires automated tools for syntax conversion. However, these tools often resort to unsafe functionalities, bypassing Rust's safety checks. This paper advocates for automating the elimination of such unsafe constructs using a Large Language Model (LLM)-based approach.

PR² uses decision-tree-based prompting to guide the translation of raw pointers to safer constructs such as Option, Box, Vec, and slices. This translation maintains the semantic integrity of the original C code while harnessing Rust's safety features. The process involves analyzing aliasing facts, buffer shapes, and memory manipulation patterns associated with raw pointers. The research employs compilation error fixes and test case validations to ensure the semantic correctness of the translated code, leveraging LLMs for enhanced semantic understanding and code generation capabilities.

Evaluation

The prototype implementation of PR² was evaluated on 28 real-world C projects, achieving a noteworthy elimination of 13.22% of raw pointers across these projects. The translation process shows effective rewriting while retaining the semantic fidelity of the original code as demonstrated in controlled user studies. The average transformation time per project is approximately 5.44 hours, with a minimal cost of $1.46 per project, emphasizing the approach's efficiency and economic viability. The research draws comparisons with existing techniques, significantly outperforming state-of-the-art tools by providing more comprehensive raw pointer elimination results due to its broader spectrum of supported Rust data structures.

Implications and Future Work

The implications of this research are substantial for software engineering fields focused on memory-safe programming and secure system development. The successful application of LLMs in code transformation tasks demonstrates the potential for advancing automation in programming languages conversion, particularly from unsafe C to safer Rust. Future endeavors may involve expanding the types of data structures and transforming inter-function raw pointers to further enhance the automation level.

Additionally, incorporating more sophisticated techniques such as LLM-assisted differential testing and relational verification could improve correctness guarantees beyond test-driven validations. These expansions would further refine the translation process, ensuring high-quality, maintainable Rust code generation. The progressive improvements in LLM capabilities and inference speeds are anticipated to reduce costs and increase efficiency, offering a promising outlook for widespread adoption in industrial applications.