Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models

Published 19 Jan 2024 in cs.CL | (2401.10716v1)

Abstract: Current LLMs tailored for code tasks often adopt the pre-training-then-fine-tuning paradigm from natural language processing, modeling source code as plain text. This approach, however, overlooks the unambiguous structures inherent in programming languages. In this work, we explore data-efficient adaptation of pre-trained code models by further pre-training and fine-tuning them with program structures. Specifically, we represent programs as parse trees -- also known as concrete syntax trees (CSTs) -- and adapt pre-trained models on serialized CSTs. Although the models that we adapt have been pre-trained only on the surface form of programs, we find that a small amount of continual pre-training and fine-tuning on CSTs without changing the model architecture yields improvements over the baseline approach across various code tasks. The improvements are found to be particularly significant when there are limited training examples, demonstrating the effectiveness of integrating program structures with plain-text representation even when working with backbone models that have not been pre-trained with structures.

Abstract PDF HTML Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper's main contribution is incorporating concrete syntax trees (CSTs) into pre-trained code models to improve performance in data-scarce environments.
It introduces structured pre-training objectives such as masked subtree prediction and masked node prediction to capture both code grammar and semantics.
Experimental results demonstrate significant gains in code translation and summarization tasks, validating the efficiency of structured representations.

Summary of "Structured Code Representations Enable Data-Efficient Adaptation of Code LLMs"

Introduction

The paper presents an innovative approach for adapting pre-trained LLMs to code tasks by incorporating structured representations of code, specifically concrete syntax trees (CSTs). This approach highlights the limitation of treating source code as plain text, which overlooks the inherent structural characteristics of programming languages. By using CSTs, the models can efficiently leverage the well-defined structures for improved performance on code tasks, especially in scenarios lacking abundant data.

Methodology

Structured Code Representations

Programming languages feature grammar structures that can be translated into CSTs, allowing models to parse programs unambiguously. While CSTs are more verbose than abstract syntax trees (ASTs), they contain all syntactic details necessary for precise program reconstruction. The serialized CSTs can be processed by sequence models without architectural changes, enabling seamless adaptation to structured inputs.

Figure 1: An example Python program along with its CST (simplified for illustration) in the tree and serialized forms, respectively. Also illustrated are the masked subtree prediction and masked node prediction training objectives for adapting pre-trained models to code structures.

Pre-Training and Fine-Tuning Objectives

The paper proposes several objectives to adapt existing code models, including masked subtree prediction (MSP), masked node prediction (MNP), and text-to-tree (TeTr) conversion. These objectives help the models learn the interplay between code grammar and semantics, focusing on both terminal and non-terminal nodes within the CSTs. For decoder-only models, a causal LLM objective on serialized CSTs is employed.

Experimental Setup

Models and Datasets

Two pre-trained models are adapted: CodeT5 (an encoder-decoder model) and CodeGen (a decoder-only model). The studies leverage CodeSearchNet and Stack datasets, providing a comprehensive range of programming languages and code with annotations. Continual pre-training on structured objectives was conducted with minimal computational overhead relative to initial pre-training.

Results

Code Translation and Generation Performance

The structured approach demonstrated superior performance over base models across multiple code tasks, including translation, generation, and summarization. In particular, when using only limited labeled data, structured models significantly enhanced CodeBLEU scores and increased the proportion of functionally correct translations, as depicted in the code translation performance results.

Figure 2: Code translation performance. Left two: Java $\leftrightarrow$ C# (CodeXGLUE); right two: Java $\leftrightarrow$ Python (TransCoder). For full results on all evaluation metrics, see the Appendix.

Code Summarization

The paper shows that structured models outperform traditional text-based counterparts significantly, achieving higher BLEU scores with fewer examples, revealing the efficiency of structured adaptations in summarization tasks.

Figure 3: Code summarization performance on the CodeSearchNet benchmark.

Implications and Future Directions

The integration of CSTs into code LLMs offers a compelling means to leverage existing structured information, enabling more accurate and efficient code processing. Future advancements might explore further structural enhancements and decoding strategies, such as exploiting additional representations like control flow graphs or data flow models, potentially increasing adaptability across more diverse programming landscapes.

The methodology also suggests avenues in the creation and application of LLMs tailored to domain-specific languages, where data availability and linguistic structure pose unique challenges.

Conclusion

The paper underscores how structured representations can mitigate limitations of traditional text-based processing in code LLMs, fostering more robust model adaptations and establishing foundations for sustained research into structural leveraging in AI model design. The demonstrated improvements herald promising developments for both theoretical exploration and practical application within coding and software development contexts.