Papers
Topics
Authors
Recent
Search
2000 character limit reached

AST-Masking: Structural Pretraining for Code

Updated 31 January 2026
  • AST-masking is a technique that masks entire syntactic units based on an abstract syntax tree to better align pretraining with code structure.
  • It leverages probabilistic masking using Bernoulli draws and a two-step process to meet a specified token mask budget during training.
  • Empirical benchmarks show that AST-masking consistently outperforms random and token-level masking in tasks like code generation and completion.

Abstract Syntax Tree (AST)-masking is a family of pretraining and evaluation techniques in which masking—typically for the purpose of corruption, denoising, or probing in LLMs—is applied according to the hierarchical structure induced by the abstract syntax tree of program code, rather than at the token or character level. AST-masking aligns noising and evaluation procedures with programming language syntax, thereby enabling more faithful modeling, denoising, and assessment of code-specific linguistic and structural capabilities.

1. Formal Definitions and AST-Masking Objectives

In code modeling, let x0∈VLx_0 \in \mathcal{V}^L denote a full input sequence, which can be decomposed as %%%%1%%%%: pp is a natural-language prompt, rr is a chain of thought (if present), and cc is the code region of length LcL^c (with sequence length LL overall). The code is parsed into an AST T(x0)=(V,E)T(x_0) = (V, E), where each node v∈Vv \in V spans a contiguous subsequence [sv,ev)⊆{1,…,L}[s_v, e_v) \subseteq \{1, \ldots, L\} of length ℓv=ev−sv\ell_v = e_v - s_v.

The primary objective of AST-masking is to corrupt sequences by masking entire syntactic units—subtrees or node-induced spans—rather than isolated tokens. At diffusion timestep tt with a total mask budget ϵt∈[0,1]\epsilon_t \in [0,1], the model aims to mask, in expectation, ϵt⋅L\epsilon_t \cdot L tokens by randomly selecting AST-derived spans. For each candidate span vv with ℓv≥2\ell_v \geq 2, a binary indicator zv∼Bernoulli(pv)z_v \sim \text{Bernoulli}(p_v) is sampled, where

pv=1−(1−ϵt)ℓv .p_v = 1 - (1-\epsilon_t)^{\ell_v}\,.

This matches the expected number of masked tokens to ϵt⋅L\epsilon_t \cdot L when measured globally across all spans. The set of masked spans at step tt is St={v∈V∣zv=1}S_t = \{ v \in V \mid z_v = 1 \}, and tokens in ⋃v:zv=1[sv,ev)\bigcup_{v: z_v=1} [s_v, e_v) are replaced with a special ⟨MASK⟩\langle\text{MASK}\rangle symbol.

Alternative AST-masking protocols (e.g., SyntaxEval) operationalize the objective slightly differently: let CC be the set of AST node types. For a code sequence s=(s1,…,s∣s∣)s = (s_1, \dots, s_{|s|}), the masking function mC(s;c)m_C(s; c) returns all token positions dominated by nodes of type cc:

mC(s;c)={ j∣sj is under an AST node of type c}m_C(s; c) = \{\, j \mid s_j \text{ is under an AST node of type } c \}

Corrupted input s~\tilde{s} replaces tokens at positions in M=mC(s;c)M = m_C(s; c) with a mask symbol; task is to predict masked tokens or subsequences from this input.

2. AST-Guided Masking Algorithms

The forward corruption process in discrete-diffusion LLMs with AST-masking is given by

q(xt ∣ x0,t)=ASTMask(x0,S),q(x_t\,|\,x_0, t) = \text{ASTMask}(x_0, S),

where SS is determined as above. In practice, a two-step masking process is usually implemented: mask AST spans using Bernoulli draws as per pvp_v, then, if the total masked tokens c<⌊ϵtL⌋c < \lfloor \epsilon_t L \rfloor, randomly mask additional tokens to achieve the exact mask budget.

AST-guided span masking pseudocode (Zeng et al., 2 Aug 2025):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def ASTSpanMasking(x0, AST_spans, epsilon_t):
    N = int(epsilon_t * len(x0))
    m = [0] * len(x0)   # mask bit vector
    c = 0
    random.shuffle(AST_spans)
    for s, e in AST_spans:
        l = e - s
        p = 1 - (1 - epsilon_t) ** l
        if bernoulli(p) and all(m[s:e] == 0):
            m[s:e] = [1] * l
            c += l
            if c >= N:
                break
    if c < N:
        for i in random.sample([i for i, bit in enumerate(m) if bit == 0], N - c):
            m[i] = 1
    x_t = [MASK if m[i] == 1 else x0[i] for i in range(len(x0))]
    return x_t
At inference, denoising proceeds from maximally masked xTx_T to the original x0x_0 by iteratively predicting cleaner sequences using the learned model pθ(xt−1∣xt,t)p_\theta(x_{t-1} | x_t, t).

SyntaxEval employs a similar AST-driven pipeline for evaluation, masking all tokens dominated by a specific AST node type per snippet, and then comparing model behavior under random masking of the same size (Velasco et al., 2024).

3. Loss Functions and Training Dynamics

For diffusion-based LLMs with AST-masking, the primary training loss is a cross-entropy over unmasked positions:

Ldiff(θ)=Ex0∼DEt∼U(1,T)Ext∼q(⋅∣x0,t)[ CE(pθ(xt−1 ∣ xt,t),xt−1)]\mathcal{L}_\text{diff} (\theta) = \mathbb{E}_{x_0 \sim \mathcal{D}} \mathbb{E}_{t \sim U(1, T)} \mathbb{E}_{x_t \sim q(\cdot|x_0, t)} \bigl[\, \text{CE}\left(p_\theta(x_{t-1}\,|\, x_t, t), x_{t-1}\right) \bigr]

(Eq. 2 in (Zeng et al., 2 Aug 2025)). This loss formulation optimizes effective reconstruction of the underlying sequence given masked (corrupted) variants wherein syntactic structure, rather than arbitrary token positions, is disrupted.

In evaluation- and probing-centric settings (e.g., SyntaxEval), the model's ability to recover masked nodes—measured by similarity of predicted and reference AST traversal sequences—serves as the primary quantitative metric. The causal effect of masking AST structure versus random masking is estimated using average treatment effect (ATE), computed by propensity-score weighting.

4. Empirical Impact and Benchmark Results

AST-masking has been shown to substantially impact model performance on code generation and completion benchmarks. In (Zeng et al., 2 Aug 2025), two leaderboards demonstrate that AST-span masking ("AST Span + ε (Ours)") consistently outperforms both random masking and AST token-level masking across HumanEval and MBPP:

Model HumanEval @512 HumanEval @1024 MBPP @512
LLaDA-Instruct 28.66 32.32 25.89
+ Random Masking 31.71 33.54 31.13
+ AST Token Masking 31.71 28.66 24.51
+ AST Span + ε (Ours) 32.93 36.59 33.07

On HumanEval @1024 tokens, AST-span masking achieves 36.59% pass@1, compared to 33.54% (random) and 28.66% (AST token). For MBPP @512, AST-span masking reaches 33.07% vs. 31.13% (random) and 24.51% (AST token). This demonstrates that structural corruption—preserving syntactic units during the noising process—improves the model's ability to reconstruct grammatically valid code and generalizes better to unseen structures (Zeng et al., 2 Aug 2025).

Conversely, in the evaluation of bidirectional Masked LLMs (MLMs) with SyntaxEval, models perform worse when required to reconstruct syntactic (AST) node types rather than random tokens. Mean Jaccard similarity, normalized Levenshtein, and Sørensen–Dice between predicted and true AST traverse sequences are all lower under AST-masked vs. randomly masked scenarios. The largest negative ATE observed is −0.27-0.27 for for_statement nodes (Jaccard), indicating a systematic syntactic weakness (Velasco et al., 2024).

5. Methodological Variants: Probing and Causal Analysis

Beyond its role in generative modeling, AST-masking enables rigorous diagnostic and causal evaluation frameworks such as SyntaxEval. This framework operationalizes masking as a "treatment" (masking all tokens for node type cc) and compares model behavior to a "control" (random-mask of same size). Outcomes are quantified via structural similarity metrics (e.g., in-order AST traversal Jaccard) and ATE estimates, with confounder adjustment via propensity-scoring to isolate the effect of masking syntax rather than surface features.

SyntaxEval reveals that MLMs pretrained or finetuned with independent token-level corruption exhibit only partial syntactic sensitivity: high recovery performance on random tokens does not translate into robust restoration of masked AST structures. Discrepancies are most prominent for node types with high structural complexity (e.g., control-flow constructs). These findings contrast with probing studies that suggest syntax is readily encoded within internal model representations; SyntaxEval demonstrates that such representations may not support robust reconstruction when evaluated on masked, structure-aligned targets (Velasco et al., 2024).

6. Limitations and Future Directions

The primary empirical studies of AST-masking schemes are currently limited to Python code and span two settings: generative diffusion-based LLMs (Zeng et al., 2 Aug 2025) and bidirectional encoder-only MLMs (Velasco et al., 2024). Decoder-only architectures and statically-typed languages have not yet been fully assessed. Only a subset of node types (typically 11 out of 196 in Python) have been systematically analyzed; higher-order constructs such as comprehensions or exception handlers remain unexplored.

Future research directions include:

  • Expanding AST-masking and evaluation to semantic-level code properties (e.g., control-flow, data-flow, code smells).
  • Redesigning pretraining objectives to bias models toward semantic and syntactic reconstruction using subtree-level corruption.
  • Extending studies to multilingual codebases and large-scale decoder-only models to analyze whether syntactic sensitivity emerges at scale.
  • Incorporating human evaluation of syntactic correctness to validate metric-based assessments.

This suggests that AST-masking remains a foundational tool for both advancing model architectures for code and for probing genuine syntactic competence in code LLMs. Its integration in generative and analysis pipelines will likely play a central role in future research on code-oriented machine learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Abstract Syntax Tree (AST)-Masking.