AlphaTrans: A Neuro-Symbolic Compositional Approach for Repository-Level Code Translation and Validation
Abstract: Code translation transforms programs from one programming language (PL) to another. Several rule-based transpilers have been designed to automate code translation between different pairs of PLs. However, the rules can become obsolete as the PLs evolve and cannot generalize to other PLs. Recent studies have explored the automation of code translation using LLMs. One key observation is that such techniques may work well for crafted benchmarks but fail to generalize to the scale and complexity of real-world projects with dependencies, custom types, PL-specific features, etc. We propose AlphaTrans, a neuro-symbolic approach to automate repository-level code translation. AlphaTrans translates both source and test code, and employs multiple levels of validation to ensure the translation preserves the functionality of the source program. To break down the problem for LLMs, AlphaTrans leverages program analysis to decompose the program into fragments and translates them in the reverse call order. We leveraged AlphaTrans to translate ten real-world open-source projects consisting of <836, 8575, 2719> classes, methods, and tests. AlphaTrans breaks down these projects into 17874 fragments and translates the entire repository. 96.40% of the translated fragments are syntactically correct, and AlphaTrans validates the translations' runtime behavior and functional correctness for 27.03% and 25.14% of fragments. On average, the integrated translation and validation take 34 hours to translate a project, showing its scalability in practice. For the incorrect translations, AlphaTrans generates a report including existing translation, stack trace, test errors, or assertion failures. We provided these artifacts to two developers to fix the translation bugs in four projects. They were able to fix the issues in 20.1 hours on average and achieve all passing tests.
- AVATAR: A Parallel Corpus for Java-Python Program Translation. In ACL. ACL, Toronto, Canada, 2268–2281.
- Anthropic AI. 2024. Claude 3. https://www.anthropic.com/news/claude-3-family.
- Transpilers: A Systematic Mapping Review of Their Usage in Research and Industry. Applied Sciences 13 (2023), 3667.
- Ned Batchelder. 2024. Coverage.py. https://pypi.org/project/coverage.
- Dante Broggi and Yi Liu. 2023. On the Interoperability of Programming Languages via Translation. In CSCE. IEEE, Las Vegas, NV, USA, 2579–2585.
- Tree-to-tree Neural Networks for Program Translation. In NIPS. Curran Associates Inc., Red Hook, NY, USA, 2552 – 2562.
- CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model. In ICSE-SIEP. ACM, New York, NY, USA, 418–429.
- Programming Language Inter-Conversion. International Journal of Computer Applications 1, 20 (2010), 63–69.
- Hadeel A. Osman Eman J. Coco and Niemah I. Osman. 2018. JPT : A Simple Java-Python Translator. CAIJ 5, 2 (2018), 1–18.
- The Apache Software Foundation. 2024a. Apache Commons CLI. https://github.com/apache/commons-cli
- The Apache Software Foundation. 2024b. Apache Commons Codec. https://github.com/apache/commons-codec
- The Apache Software Foundation. 2024c. Apache Commons CSV. https://github.com/apache/commons-csv
- The Apache Software Foundation. 2024d. Apache Commons Exec. https://github.com/apache/commons-exec
- The Apache Software Foundation. 2024e. Apache Commons FileUpload. https://github.com/apache/commons-fileupload
- The Apache Software Foundation. 2024f. Apache Commons Graph. https://github.com/apache/commons-graph
- The Apache Software Foundation. 2024g. Apache Commons Pool. https://github.com/apache/commons-pool
- The Apache Software Foundation. 2024h. Apache Commons Validator. https://github.com/apache/commons-validator
- Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (Szeged, Hungary) (ESEC/FSE ’11). Association for Computing Machinery, New York, NY, USA, 416–419. https://doi.org/10.1145/2025113.2025179
- GitHub. 2024. CodeQL. https://codeql.github.com
- GraalVM. 2024. Polyglot API. https://www.graalvm.org/latest/reference-manual/polyglot-programming.
- Mutation analysis for evaluating code translation. Empirical Software Engineering 29 (2023), 23Â pages.
- DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196
- Ali Reza Ibrahimzada. 2024. Program Decomposition and Translation with Static Analysis. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 453–455.
- Immunant. 2024. C2Rust Transpiler. https://github.com/immunant/c2rust.
- Paul Irwin. 2024. Java to CSharp Converter. https://github.com/paulirwin/JavaToCSharp.
- Suman Jain and Inderveer Chana. 2015. Modernization of Legacy Systems: A Generalised Roadmap. In ICCCT. ACM, New York, NY, USA, 62–67.
- Cloud Migration Research: A Systematic Review. IEEE Transactions on Cloud Computing 1 (2013), 142–157.
- On the evaluation of neural code translation: Taxonomy and benchmark. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Las Vegas, NV, USA, 1529–1541.
- How Do Professionals Perceive Legacy Systems and Software Modernization? In ICSE. ACM, New York, NY, USA, 36–47.
- Modernization Framework to Enhance the Security of Legacy Information Systems. Intelligent Automation & Soft Computing 32 (2022), 543–555.
- Intelligent CAT Lab. 2024. Artifact Website. https://github.com/Intelligent-CAT-Lab/AlphaTrans.
- Kevin Lano and Hanan Siala. 2024. Using model-driven engineering to automate software language translation. Automated Software Engineering 31 (2024), 59Â pages.
- Daniel Lemire. 2024. JavaFastPFOR. https://github.com/lemire/JavaFastPFOR
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
- Lost in the Middle: How Language Models Use Long Contexts. TACL 12 (2024), 157–173.
- Lexical Statistical Machine Translation for Language Migration. In FSE. ACM, New York, NY, USA, 651–654.
- Migrating Code with Statistical Machine Translation. In ICSE Companion. ACM, New York, NY, USA, 544–547.
- Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code. In ASE. IEEE, Las Vegas, NV, USA, 585–596.
- SpecTra: Enhancing the Code Translation Ability of Language Models by Generating Multi-Modal Specifications. arXiv:2405.18574
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774
- Oracle. 2024. GraalVM. https://www.graalvm.org.
- Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code. In ICSE. ACM, New York, NY, USA, 866–866.
- Automated Test Case Generation as a Many-Objective Optimisation Problem with Dynamic Selection of the Targets. IEEE Transactions on Software Engineering 44, 2 (2018), 122–158. https://doi.org/10.1109/TSE.2017.2663435
- Mono Project. 2023. Sharpen - Automated Java-¿C# coversion. https://github.com/mono/sharpen.
- pytest dev. 2024. Pytest. https://www.pytest.org.
- Lili Qiu. 1999. Programming Language Translation. Technical Report. Cornell University, USA.
- Unsupervised Translation of Programming Languages. In NIPS. Curran Associates Inc., Red Hook, NY, USA, 20601 – 20611.
- Leveraging Automated Unit Tests for Unsupervised Code Translation. arXiv:2110.06773
- Fuse Source. 2024. Jansi. https://github.com/fusesource/jansi
- Charles Spearman. 1961. The Proof and Measurement of Association between Two Things. The American Journal of Psychology 15 (1961), 72 –– 101.
- The JaCoCo Team. 2024a. Java Code Coverage. https://www.eclemma.org/jacoco/
- The JUnit Team. 2024b. JUnit. https://junit.org/junit5/
- Andrey A Terekhov and Chris Verhoef. 2000. The Realities of Language Conversions. IEEE Software 17 (2000), 111–124.
- TIOBE. 2023. TIOBE Index. https://www.tiobe.com/tiobe-index.
- StructCoder: Structure-Aware Transformer for Code Generation. Transactions on Knowledge Discovery from Data 18 (2024), 1–20.
- Go Transpile. 2024. C to Go Translator. https://github.com/gotranspile/cxgo.
- Tree-Sitter. 2024. Tree-Sitter Library. https://tree-sitter.github.io/tree-sitter/
- One VM to Rule Them All. In Onward! ACM, New York, NY, USA, 187–204.
- CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation. In EMNLP. ACL, Singapore, 5067–5089.
- VERT: Verified Equivalent Rust Transpilation with Large Language Models as Few-Shot Learners. arXiv:2404.18852
- Exploring and Unleashing the Power of Large Language Models in Automated Code Translation. FSE 1 (2024), 1585–1608.
- Rectifier: Code Translation with Corrector via LLMs. arXiv:2407.07472
- Multilingual Code Snippets Training for Program Translation. AAAI 36 (2022), 11783–11790.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.