Context-aware Code Segmentation for C-to-Rust Translation using Large Language Models

Published 16 Sep 2024 in cs.SE | (2409.10506v1)

Abstract: There is strong motivation to translate C code into Rust code due to the continuing threat of memory safety vulnerabilities in existing C programs and the significant attention paid to Rust as an alternative to the C language. While LLMs show promise for automating this translation by generating more natural and safer code than rule-based methods, previous studies have shown that LLM-generated Rust code often fails to compile, even for relatively small C programs, due to significant differences between the two languages and context window limitations. We propose an LLM-based translation scheme that improves the success rate of translating large-scale C code into compilable Rust code. Our approach involves three key techniques: (1) pre-processing the C code to better align its structure and expressions with Rust, (2) segmenting the code into optimally sized translation units to avoid exceeding the LLM's context window limits, and (3) iteratively compiling and repairing errors while maintaining consistency between translation units using context-supplementing prompts. Compilation success is an essential first step in achieving functional equivalence, as only compilable code can be further tested. In experiments with 20 benchmark C programs, including those exceeding 4 kilo lines of code, we successfully translated all programs into compilable Rust code without losing corresponding parts of the original code.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an innovative LLM-based translation scheme that segments C code into optimal units for idiomatic Rust conversion.
It leverages pre-processing and iterative repair to align C code with Rust safety features and overcome context window limitations.
Experimental evaluation on 20 benchmarks shows a 31% increase in compilation line coverage and a 24% improvement in element coverage.

Context-aware Code Segmentation for C-to-Rust Translation using LLMs

The paper presents a novel approach to the automatic translation of C code into Rust using LLMs, addressing the notorious challenge of memory safety vulnerabilities in C. The motivation stems from Rust's growing reputation as a secure, system-level programming language, prompting organizations to consider translating existing C codebases into Rust.

Problem Statement and Challenges

The translation from C to Rust is complicated by fundamental syntactical and semantic differences between the languages. Traditional rule-based translation methods often generate non-idiomatic Rust code laden with unsafe constructs. LLMs hold promise for producing more idiomatic and safe Rust code. However, the limited context window of LLMs introduces a barrier when handling large codebases—an area where previous studies have reported poor compilation success rates for the Rust code generated.

Proposed Methodology

To tackle these challenges, this paper introduces an LLM-based translation scheme comprising three core techniques:

Pre-processing: C code is restructured to align more closely with Rust semantics. Static analysis tools reposition macros, functions, and module definitions to synthesize a more coherent input for the LLM.
Segmentation: C code is divided into translation units of an optimal size based on empirical analysis of LLM context window limits. This segmentation ensures that the code remains within the processing capability of the LLM without degrading translation accuracy.
Iterative Compilation and Repair: Translated Rust code undergoes compilation, with any errors being rectified through LLM-driven iterative repair. This phase also includes context-supplementing prompts to ensure consistency across translation units by storing metadata about function signatures and dependencies.

Experimental Evaluation

The evaluation involved translating 20 benchmark C programs, with line counts reaching as high as 4,484. Noteworthy outcomes include:

Successful translation of all test programs into compilable Rust code, even those over 4,000 lines, demonstrating the robustness of the proposed segmentation approach.
An average compilation line coverage increase of 31% and element coverage improvement of 24% across different LLMs, most significantly with Claude 3.5 Sonnet.

The findings underscore the potential of integrating pre-processing and context-augmenting techniques in LLM-based translation for handling extensive and complex C codebases.

Implications and Future Directions

This research lays a foundation for more secure and efficient code translation processes. The implications are both practical, in handling large legacy codebases, and theoretical, in refining LLM capabilities for code translation tasks. The iterative repair process highlights the adaptability of LLMs in responding to dynamic compilation constraints.

However, while compilation success is vital, ensuring the functional equivalence of translated code is equally crucial. Future research could focus on refining output to leverage Rust's safety features fully, going beyond mere compilability to addressing execution-time correctness through enhanced syntactic and semantic understanding within LLMs.

The paper sheds light on a promising direction for automating secure codebase migrations, positioning LLMs as vital tools in the future landscape of software engineering, particularly for bridging legacy and modern programming paradigms. The roadmap delineated here underscores a pragmatic progression toward more reliable, scalable, and maintainable code translation methodologies.

Markdown Report Issue