Peephole Raw Pointer Rewriting with LLMs for Translating C to Safer Rust
The paper "PR²: Peephole Raw Pointer Rewriting with LLMs for Translating C to Safer Rust" addresses the significant challenge of converting C code into Rust, focusing on enhancing memory safety through the elimination of raw pointers. Despite Rust's robust memory and thread safety capabilities, existing tools like C2Rust often produce Rust code with unsafe constructs, notably raw pointers, which undermines safety. This research proposes a novel peephole raw pointer rewriting technique that lifts raw pointers within individual functions to safer Rust data structures, significantly improving the safety of Rust programs generated from C code.
Motivation and Approach
The use of C language in system software is widespread due to its efficiency and low-level control capabilities. However, the lack of inherent memory safety in C leads to vulnerabilities, such as the infamous Heartbleed bug. Conversely, Rust offers strong memory safety guarantees via its ownership model and borrow checker, prompting a shift towards Rust in many critical software projects. Translating C code to Rust is a complex task that requires automated tools for syntax conversion. However, these tools often resort to unsafe functionalities, bypassing Rust's safety checks. This paper advocates for automating the elimination of such unsafe constructs using a Large Language Model (LLM)-based approach.
PR² uses decision-tree-based prompting to guide the translation of raw pointers to safer constructs such as Option, Box, Vec, and slices. This translation maintains the semantic integrity of the original C code while harnessing Rust's safety features. The process involves analyzing aliasing facts, buffer shapes, and memory manipulation patterns associated with raw pointers. The research employs compilation error fixes and test case validations to ensure the semantic correctness of the translated code, leveraging LLMs for enhanced semantic understanding and code generation capabilities.
Evaluation
The prototype implementation of PR² was evaluated on 28 real-world C projects, achieving a noteworthy elimination of 13.22% of raw pointers across these projects. The translation process shows effective rewriting while retaining the semantic fidelity of the original code as demonstrated in controlled user studies. The average transformation time per project is approximately 5.44 hours, with a minimal cost of $1.46 per project, emphasizing the approach's efficiency and economic viability. The research draws comparisons with existing techniques, significantly outperforming state-of-the-art tools by providing more comprehensive raw pointer elimination results due to its broader spectrum of supported Rust data structures.
Implications and Future Work
The implications of this research are substantial for software engineering fields focused on memory-safe programming and secure system development. The successful application of LLMs in code transformation tasks demonstrates the potential for advancing automation in programming languages conversion, particularly from unsafe C to safer Rust. Future endeavors may involve expanding the types of data structures and transforming inter-function raw pointers to further enhance the automation level.
Additionally, incorporating more sophisticated techniques such as LLM-assisted differential testing and relational verification could improve correctness guarantees beyond test-driven validations. These expansions would further refine the translation process, ensuring high-quality, maintainable Rust code generation. The progressive improvements in LLM capabilities and inference speeds are anticipated to reduce costs and increase efficiency, offering a promising outlook for widespread adoption in industrial applications.