AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models

Published 23 Jun 2022 in cs.CL, cs.AI, cs.LG, cs.PL, and cs.SE | (2206.11719v2)

Abstract: The objective of pre-trained LLMs is to learn contextual representations of textual data. Pre-trained LLMs have become mainstream in natural language processing and code modeling. Using probes, a technique to study the linguistic properties of hidden vector spaces, previous works have shown that these pre-trained LLMs encode simple linguistic properties in their hidden representations. However, none of the previous work assessed whether these models encode the whole grammatical structure of a programming language. In this paper, we prove the existence of a syntactic subspace, lying in the hidden representations of pre-trained LLMs, which contain the syntactic information of the programming language. We show that this subspace can be extracted from the models' representations and define a novel probing method, the AST-Probe, that enables recovering the whole abstract syntax tree (AST) of an input code snippet. In our experimentations, we show that this syntactic subspace exists in five state-of-the-art pre-trained LLMs. In addition, we highlight that the middle layers of the models are the ones that encode most of the AST information. Finally, we estimate the optimal size of this syntactic subspace and show that its dimension is substantially lower than those of the models' representation spaces. This suggests that pre-trained LLMs use a small part of their representation spaces to encode syntactic information of the programming languages.

Abstract PDF Upgrade to Chat

Citations (19)

View on Semantic Scholar

Summary

The paper introduces AST-Probe to determine if pre-trained models encode full syntactic structures by projecting token embeddings onto a syntactic subspace.
The methodology employs an orthogonal projection to isolate AST-related features, achieving high precision in recovering syntax trees across several programming languages.
Experimental results reveal that models like GraphCodeBERT and CodeBERT capture syntactic features primarily in middle layers, with optimal subspace dimensions between 64 and 128.

AST-Probe: Recovering Abstract Syntax Trees from Hidden Representations of Pre-trained LLMs

The paper "AST-Probe: Recovering Abstract Syntax Trees from Hidden Representations of Pre-trained LLMs" proposes a novel method to determine whether pre-trained LLMs for programming languages encode the entire syntactic structure of code in their hidden representations. This involves probing these models to extract the Abstract Syntax Trees (ASTs) associated with code snippets. The authors introduce the AST-Probe, which aims to identify a syntactic subspace within model representations and utilize this discovery to fully recover ASTs from embedded code data.

Introduction

The application of NLP techniques to source code analysis has led to significant improvements in automating various tasks such as code completion, search, and summarization. Pre-trained LLMs like BERT, GPT, CodeBERT, and others have facilitated these advancements by learning to represent source code meaningfully. Despite these successes, there is still a lack of understanding regarding the specific syntactic properties these models capture.

AST-Probe is introduced to bridge this gap by assessing whether a syntactic subspace exists within the latent spaces of these models that can encapsulate the full grammatical structure of programming languages. This would imply the models are not only capturing linguistic nuances but also complex syntax relevant to ASTs.

The AST-Probe Approach

The AST-Probe methodology involves projecting token embeddings derived from LLMs onto a lower-dimensional space, hypothesized to contain syntactic structures. After projection, a reconstructed AST is generated from these embeddings. The methodology revolves around the following components:

Syntactic Subspace Identification

Projection Mechanism: Define an orthogonal projection from the model’s representation space to a hypothesized syntactic subspace $\mathcal{S}$ . The vectors in this subspace are expected to retain AST information.
Vector Transformation: Token embeddings undergo transformation via this projection, effectively isolating the syntactic features.
Figure 1: Visualization of the projection. The dotted blue lines represent the projection $P_\mathcal{S}$ .

AST Recovery

From the syntactic subspace vectors, AST-Probe seeks to reconstruct the AST using geometric properties and learned vector relationships.

Figure 2: Overview of the AST-Probe. The syntactic vectors are obtained using the projection $P_\mathcal{S}$ .

Experimental Setup

The evaluation was conducted on five state-of-the-art LLMs: CodeBERT, GraphCodeBERT, CodeT5, CodeBERTa, and RoBERTa, covering Python, JavaScript, and Go languages. Key metrics involved precision, recall, and F1-score in recovering ASTs:

Baselines vs. Non-Baselines: A significant discrepancy was observed between baselines (uncontextualized or randomly initialized representations) and fine-tuned models, verifying the effectiveness of AST-Probe.
Model Comparison: GraphCodeBERT and CodeBERT demonstrated superior performance in capturing AST structures, corroborating their practical efficacy in downstream tasks.
Layer Analysis: Middle layers of these models predominantly retained syntactic information, synchronized with trends observed in NLP models concerning syntactic learning.
Syntactic Subspace Dimension: The optimal dimensions for the syntactic subspace were consistently found between 64 and 128, indicating that syntactic features are densely encoded.
Figure 3: Result of the probe for each model according to their layers. The x-axis represents the layer number and the y-axis the $F_1$ -score. The CodeBERTrand's layer 0 corresponds to CodeBERT-0.

Discussion and Implications

The study presents a comprehensive approach to probe into and visualize the syntactic comprehension of programming language within pre-trained models. The findings suggest that state-of-the-art models effectively encode meaningful AST-related syntactic information, which is compactly stored.

These insights have implications beyond pure academic interest, as understanding the inner workings of these models can inform better design and fine-tuning strategies for niche applications in automated code analysis and generation. Furthermore, the potential correlation of syntactic understanding with task performance merits exploration.

Conclusion

AST-Probe provides a sophisticated framework to quantitatively analyze the syntactic understanding within pre-trained LLMs’ hidden layers. Future research could explore expanding the diversity of models and languages probed, assessing correlations with model performance on functional tasks, and exploring how fine-tuning impacts syntactic capacity retention.

The innovation demonstrated by AST-Probe suggests promising directions for further study into the interpretability and optimization of LLMs in software engineering contexts.

Markdown Report Issue