Code Prediction by Feeding Trees to Transformers

Published 30 Mar 2020 in cs.SE and cs.LG | (2003.13848v4)

Abstract: We advance the state-of-the-art in the accuracy of code prediction (next token prediction) used in autocomplete systems. First, we report that using the recently proposed Transformer architecture even out-of-the-box outperforms previous neural and non-neural systems for code prediction. We then show that by making the Transformer architecture aware of the syntactic structure of code, we further increase the margin by which a Transformer-based system outperforms previous systems. With this, it outperforms the accuracy of an RNN-based system (similar to Hellendoorn et al. 2018) by 18.3%, the Deep3 system (Raychev et al 2016) by 14.1%, and an adaptation of Code2Seq (Alon et al., 2018) for code prediction by 14.4%. We present in the paper several ways of communicating the code structure to the Transformer, which is fundamentally built for processing sequence data. We provide a comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a Facebook internal Python corpus. Our code and data preparation pipeline will be available in open source.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (196)

View on Semantic Scholar

Summary

The paper demonstrates that integrating AST structure into Transformer architectures significantly boosts next-token prediction accuracy in code autocomplete systems.
It introduces and compares models like SeqTrans, PathTrans, and TravTrans, with TravTrans achieving a 14.1% MRR improvement over prior state-of-the-art methods.
The study highlights the practical value of syntactic awareness in Transformers for enhancing IDE performance and motivates further research into structural integration.

Code Prediction by Feeding Trees to Transformers

The paper "Code Prediction by Feeding Trees to Transformers" contributes to the field of machine learning systems for code prediction, specifically the autocomplete feature in integrated development environments (IDEs). The primary goal is to improve the accuracy of next token predictions by utilizing Transformer-based architectures that are informed by the syntactic structure of the code.

Key Approaches and Models

The authors propose employing the Transformer neural architecture for code prediction, motivated by its established efficacy in handling long-range dependencies in sequence data tasks in NLP. They focus on adapting Transformers to consider the structural information contained in abstract syntax trees (ASTs) of code. The paper introduces several models to explore this hypothesis:

SeqTrans: This model uses a vanilla Transformer applied directly to serialized source token sequences. It serves as a baseline to demonstrate the effectiveness of Transformers in comparison to recurrent neural network (RNN)-based models.
PathTrans: The model introduces the use of root paths—paths from leaf nodes to the root of the AST—encoded with LSTMs, integrated into a Transformer. This aims to leverage syntactic groupings inherent in the code structure.
TravTrans: By using a pre-order traversal of AST nodes as input, this model processes ASTs as sequences, maintaining more complex tree relationships. TravTrans shows a significant improvement over PathTrans and SeqTrans, demonstrating that preserving AST order in Transformer inputs enhances performance.

Additionally, a variant TravTrans+ incorporates detailed structural information between nodes through a path-based matrix added to the attention layers, hinting at further gains when even more AST structure is incorporated.

Evaluation and Results

The paper benchmarks these models against previous state-of-the-art techniques, including RNN-based (SeqRNN) and decision tree-based (Deep3) models, using the py150 Python dataset and an internal dataset. The results indicate substantial improvements:

TravTrans achieves a mean reciprocal rank (MRR) improvement of 14.1% over Deep3.
Overall, TravTrans shows better MRR scores than Code2Seq and PointerMixture while demonstrating marked improvements in handling different token types, such as attribute access, numeric constants, and names.
The comparative evaluation shows that Transformer models informed by AST structure consistently outperform sequence-based RNN models.

Implications and Future Directions

The paper's findings have implications both theoretically and practically. Theoretically, it underscores the potential for syntactic awareness in Transformer models, encouraging further exploration into integrating ASTs or other structural components in neural architectures. Practically, it suggests pathways for enhancing IDE autocomplete systems, which could lead to more efficient and effective coding practices.

Future research could explore methods for improved handling of out-of-vocabulary tokens—a challenge in predicting rare or unseen code tokens—and exploring other programming languages to generalize the findings further. Moreover, evaluating the Transformer models' interaction with other contextual information relevant to code, such as semantic analysis data, could yield additional improvements.

The proposed open-source availability of the code and data preparation pipeline is also a valuable contribution to the community, encouraging reproducibility and further research based on the methods explored in this paper.

Markdown Report Issue