AST-Masking: Structural Pretraining for Code
- AST-masking is a technique that masks entire syntactic units based on an abstract syntax tree to better align pretraining with code structure.
- It leverages probabilistic masking using Bernoulli draws and a two-step process to meet a specified token mask budget during training.
- Empirical benchmarks show that AST-masking consistently outperforms random and token-level masking in tasks like code generation and completion.
Abstract Syntax Tree (AST)-masking is a family of pretraining and evaluation techniques in which masking—typically for the purpose of corruption, denoising, or probing in LLMs—is applied according to the hierarchical structure induced by the abstract syntax tree of program code, rather than at the token or character level. AST-masking aligns noising and evaluation procedures with programming language syntax, thereby enabling more faithful modeling, denoising, and assessment of code-specific linguistic and structural capabilities.
1. Formal Definitions and AST-Masking Objectives
In code modeling, let denote a full input sequence, which can be decomposed as %%%%1%%%%: is a natural-language prompt, is a chain of thought (if present), and is the code region of length (with sequence length overall). The code is parsed into an AST , where each node spans a contiguous subsequence of length .
The primary objective of AST-masking is to corrupt sequences by masking entire syntactic units—subtrees or node-induced spans—rather than isolated tokens. At diffusion timestep with a total mask budget , the model aims to mask, in expectation, tokens by randomly selecting AST-derived spans. For each candidate span with , a binary indicator is sampled, where
This matches the expected number of masked tokens to when measured globally across all spans. The set of masked spans at step is , and tokens in are replaced with a special symbol.
Alternative AST-masking protocols (e.g., SyntaxEval) operationalize the objective slightly differently: let be the set of AST node types. For a code sequence , the masking function returns all token positions dominated by nodes of type :
Corrupted input replaces tokens at positions in with a mask symbol; task is to predict masked tokens or subsequences from this input.
2. AST-Guided Masking Algorithms
The forward corruption process in discrete-diffusion LLMs with AST-masking is given by
where is determined as above. In practice, a two-step masking process is usually implemented: mask AST spans using Bernoulli draws as per , then, if the total masked tokens , randomly mask additional tokens to achieve the exact mask budget.
AST-guided span masking pseudocode (Zeng et al., 2 Aug 2025):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
def ASTSpanMasking(x0, AST_spans, epsilon_t): N = int(epsilon_t * len(x0)) m = [0] * len(x0) # mask bit vector c = 0 random.shuffle(AST_spans) for s, e in AST_spans: l = e - s p = 1 - (1 - epsilon_t) ** l if bernoulli(p) and all(m[s:e] == 0): m[s:e] = [1] * l c += l if c >= N: break if c < N: for i in random.sample([i for i, bit in enumerate(m) if bit == 0], N - c): m[i] = 1 x_t = [MASK if m[i] == 1 else x0[i] for i in range(len(x0))] return x_t |
SyntaxEval employs a similar AST-driven pipeline for evaluation, masking all tokens dominated by a specific AST node type per snippet, and then comparing model behavior under random masking of the same size (Velasco et al., 2024).
3. Loss Functions and Training Dynamics
For diffusion-based LLMs with AST-masking, the primary training loss is a cross-entropy over unmasked positions:
(Eq. 2 in (Zeng et al., 2 Aug 2025)). This loss formulation optimizes effective reconstruction of the underlying sequence given masked (corrupted) variants wherein syntactic structure, rather than arbitrary token positions, is disrupted.
In evaluation- and probing-centric settings (e.g., SyntaxEval), the model's ability to recover masked nodes—measured by similarity of predicted and reference AST traversal sequences—serves as the primary quantitative metric. The causal effect of masking AST structure versus random masking is estimated using average treatment effect (ATE), computed by propensity-score weighting.
4. Empirical Impact and Benchmark Results
AST-masking has been shown to substantially impact model performance on code generation and completion benchmarks. In (Zeng et al., 2 Aug 2025), two leaderboards demonstrate that AST-span masking ("AST Span + ε (Ours)") consistently outperforms both random masking and AST token-level masking across HumanEval and MBPP:
| Model | HumanEval @512 | HumanEval @1024 | MBPP @512 |
|---|---|---|---|
| LLaDA-Instruct | 28.66 | 32.32 | 25.89 |
| + Random Masking | 31.71 | 33.54 | 31.13 |
| + AST Token Masking | 31.71 | 28.66 | 24.51 |
| + AST Span + ε (Ours) | 32.93 | 36.59 | 33.07 |
On HumanEval @1024 tokens, AST-span masking achieves 36.59% pass@1, compared to 33.54% (random) and 28.66% (AST token). For MBPP @512, AST-span masking reaches 33.07% vs. 31.13% (random) and 24.51% (AST token). This demonstrates that structural corruption—preserving syntactic units during the noising process—improves the model's ability to reconstruct grammatically valid code and generalizes better to unseen structures (Zeng et al., 2 Aug 2025).
Conversely, in the evaluation of bidirectional Masked LLMs (MLMs) with SyntaxEval, models perform worse when required to reconstruct syntactic (AST) node types rather than random tokens. Mean Jaccard similarity, normalized Levenshtein, and Sørensen–Dice between predicted and true AST traverse sequences are all lower under AST-masked vs. randomly masked scenarios. The largest negative ATE observed is for for_statement nodes (Jaccard), indicating a systematic syntactic weakness (Velasco et al., 2024).
5. Methodological Variants: Probing and Causal Analysis
Beyond its role in generative modeling, AST-masking enables rigorous diagnostic and causal evaluation frameworks such as SyntaxEval. This framework operationalizes masking as a "treatment" (masking all tokens for node type ) and compares model behavior to a "control" (random-mask of same size). Outcomes are quantified via structural similarity metrics (e.g., in-order AST traversal Jaccard) and ATE estimates, with confounder adjustment via propensity-scoring to isolate the effect of masking syntax rather than surface features.
SyntaxEval reveals that MLMs pretrained or finetuned with independent token-level corruption exhibit only partial syntactic sensitivity: high recovery performance on random tokens does not translate into robust restoration of masked AST structures. Discrepancies are most prominent for node types with high structural complexity (e.g., control-flow constructs). These findings contrast with probing studies that suggest syntax is readily encoded within internal model representations; SyntaxEval demonstrates that such representations may not support robust reconstruction when evaluated on masked, structure-aligned targets (Velasco et al., 2024).
6. Limitations and Future Directions
The primary empirical studies of AST-masking schemes are currently limited to Python code and span two settings: generative diffusion-based LLMs (Zeng et al., 2 Aug 2025) and bidirectional encoder-only MLMs (Velasco et al., 2024). Decoder-only architectures and statically-typed languages have not yet been fully assessed. Only a subset of node types (typically 11 out of 196 in Python) have been systematically analyzed; higher-order constructs such as comprehensions or exception handlers remain unexplored.
Future research directions include:
- Expanding AST-masking and evaluation to semantic-level code properties (e.g., control-flow, data-flow, code smells).
- Redesigning pretraining objectives to bias models toward semantic and syntactic reconstruction using subtree-level corruption.
- Extending studies to multilingual codebases and large-scale decoder-only models to analyze whether syntactic sensitivity emerges at scale.
- Incorporating human evaluation of syntactic correctness to validate metric-based assessments.
This suggests that AST-masking remains a foundational tool for both advancing model architectures for code and for probing genuine syntactic competence in code LLMs. Its integration in generative and analysis pipelines will likely play a central role in future research on code-oriented machine learning.