Fill-in-the-Middle Tokens

Updated 19 February 2026

Fill-in-the-Middle tokens are a strategy that generates an intermediate text span from given prefix and suffix contexts using specialized control tokens.
They employ markers like <PRE>, <SUF>, and <MID> to partition and guide the bidirectional infilling process in autoregressive models.
Empirical studies show that FIM enhances code completion accuracy, reduces latency, and aligns with complex developer workflows.

Fill-in-the-Middle (FIM) tokens constitute a paradigm and tokenization strategy that enables autoregressive LLMs—particularly LLMs for code—to generate an arbitrary “middle” span of text given both preceding (prefix) and subsequent (suffix) context. FIM tokenization introduces dedicated control tokens that mark the segments of the input and output, decoupling standard left-to-right text continuation from bidirectional infilling. FIM has become the de facto standard for modern code completion LLMs due to its simplicity, architectural compatibility, and empirical gains in infilling accuracy and developer workflow alignment (Sun et al., 29 Sep 2025, Ding et al., 2024, Sagtani et al., 2024, Gong et al., 2024, Bavarian et al., 2022).

1. Formal Definition and Tokenization Schemes

At its core, FIM takes a contiguous sequence of tokens $x = (x_1, ..., x_n)$ , randomly selects a middle subspan $(x_{i}, ..., x_{j-1})$ , and partitions the sequence into:

Prefix $P = (x_1, ..., x_{i-1})$
Middle $M = (x_{i}, ..., x_{j-1})$
Suffix $S = (x_{j}, ..., x_n)$

Special sentinel tokens such as <PRE>, <SUF>, and <MID> (notation varies: e.g., <|fim_prefix|>, etc.) are concatenated to organize the prompt:

Prefix–Suffix–Middle (PSM): <PRE> P <SUF> S <MID>
Suffix–Prefix–Middle (SPM): <SUF> S <PRE> P <MID>

The model is then autoregressively trained to generate the middle segment $M$ given the concatenated (prefix, suffix) context. The canonical FIM objective is: $\max_\theta \mathbb{E}_{(P, M, S) \sim D}\; [\; \log P_\theta(M \mid P, S) \;]$ where $D$ denotes the pretraining dataset (Sun et al., 29 Sep 2025, Bavarian et al., 2022, Guo et al., 2024).

Variants such as FIM-SE and AST-FIM introduce further constraints (e.g., line-level, AST-node masking) or additional tokens for character- or structure-aware infilling (Ren et al., 2024, Gong et al., 30 May 2025).

2. Training Paradigms, Objective Functions, and Practical Encoding

FIM pretraining is primarily implemented as a data transformation rather than an architectural change. Models interleave FIM-formatted examples with standard next-token autoregressive (L2R) samples—typically mixing at rates $p \in [0.5, 0.9]$ (Bavarian et al., 2022, Guo et al., 2024). Each example retains the identical cross-entropy loss structure, with the target restricted to the middle span after sentinel tokens are injected: $\mathcal{L}_{\text{FIM}} = - \sum_{t=1}^{|M|} \log P_\theta(m_t | P, S, m_{<t})$ Recent research extends the objective: Horizon-Length Prediction (HLP) augments standard next-token loss with a regression signal for the remaining infill token budge in each step, explicitly enabling boundary planning (Ding et al., 2024). IFIM further generalizes the input quadruplet to (prefix, instruction, suffix, middle), with a new <INS> token marking a natural language developer comment to be leveraged alongside context (Sun et al., 29 Sep 2025).

Empirical findings consistently show that FIM pretraining does not degrade standard L2R performance up to high FIM rates (even 90%), and that syntax-aware or curriculum-augmented span selection further boosts infilling proficiency, especially for challenging code constructs (Sagtani et al., 2024, Gong et al., 30 May 2025).

3. Evaluation Benchmarks and Syntax-Awareness

FIM-based models are evaluated on code infilling tasks that reflect practical use-cases: single- or multi-line insertion, random-span masking, control flow completion, or API call recovery. Notable public benchmarks include:

HumanEval-infilling: Python code tasks with masked spans, measured by pass@1 (reference solution match/unit test success) (Sun et al., 29 Sep 2025).
SAFIM: Syntax-Aware FIM, segmenting completions into algorithmic block, control-flow, and API call splits with 17,720 multi-language examples (Gong et al., 2024).
Real-FIM-Eval: Derived from real-world git commits, measuring perplexity on real insertion and edit tasks across 12 languages (Gong et al., 30 May 2025).

Syntax-aware post-processing methods, such as AST-guided truncation or incremental parsing with context-sensitive left/right quotients, meaningfully reduce compile errors and ensure syntactic validity of completions. Earley-style incremental parsing with quotient grammars supports syntactic-constrained decoding, boosting valid completion rates from ~65% to nearly 90% on challenging corpora (Melcer et al., 2024). AST-based FIM (AST-FIM) masking aligns infill spans with code block or function constructs, rather than arbitrary token ranges, yielding up to +7.4 pp improvements on SAFIM benchmarks (Gong et al., 30 May 2025).

4. Practical Model Engineering, Inference, and Efficiency

FIM tokenization is compatible with standard decoder-only transformers. No changes to attention masks or transformer block structure are necessary—sentinel tokens are simply embedded alongside the regular vocabulary (Bavarian et al., 2022, Guo et al., 2024). For long-context infilling (e.g., 16k token windows in DeepSeek-Coder), rotary position embeddings are extended, attention optimizations (e.g., FlashAttention v2), and group-query attention applied (Guo et al., 2024).

KV-cache reuse in interactive serving is an efficiency bottleneck due to prompt structure: changes to prefix/suffix typically invalidate each other's cached keys/values. The EFIM prompt transformation freezes the shared prefix and suffix portions, appends new tokens at the end, and incorporates fragment tokenization at training time to ensure accurate subtoken generation. This yields a 52% average reduction in latency and doubles throughput in multi-user settings without infilling degradation (Guo et al., 28 May 2025).

At the byte level, tokenization bias (i.e., zero-probability holes induced by prompts ending mid-subword/unit) degrades SPM-style FIM infilling under standard token-level autoregressive LMs. Recent methods recover exact byte-level sampling via the Byte-Token Representation Lemma, marginalizing over all possible token extensions of a prefix, and eliminating bias without model retraining. This correction yields an 18% pass@1 gain on open-coded FIM benchmarks in SPM mode (Phan et al., 2024).

5. Extensions: Instruction Awareness, Reasoning, and Robust Infilling

Base FIM models perform suboptimally when the developer's intent is underspecified by code context alone. Pure FIM-pretrained LLMs tend to ignore embedded natural-language comments, while unstructured instruction tuning destroys FIM alignment. The Instruction-aware FIM (IFIM) paradigm introduces a fourth instruction segment with a dedicated delimiter, synthesizing training data via LLM-generated comments. IFIM fine-tuning (without further model modification) delivers dramatic pass@1 gains on instruction-aware infilling tasks, e.g., +9.0pp on HumanEval-infilling for Deepseek-Coder (84.6%→93.6%) (Sun et al., 29 Sep 2025), and critically, does not diminish vanilla FIM code completion in instruction-absent settings.

FIM tokenization has also been extended outside code, e.g., for mathematical reasoning. MathFimer applies FIM to chain-of-thought solutions—randomly hiding intermediate reasoning steps and directing models to interpolate them between visible context. This methodology improves reasoning accuracy across datasets (e.g., +7.43pp on GSM8K, +4.16pp on MATH) and is composable: multi-round expansion of solution chains yields compounding improvements (Yan et al., 17 Feb 2025).

Character-level infilling with sub-token boundary management (FIM-SE) or correct factorization over mask sets in MARIA (Masked and Autoregressive Infilling Architecture) further expand FIM's utility. FIM-SE constrains the boundary alignment to avoid broken tokens at insert points, realizing up to +11.5pp gains on single-line infilling without architectural cost (Ren et al., 2024). MARIA fuses AR and MLM hidden states, with a learned linear head, outperforming discrete diffusion baselines across all mask rates (Israel et al., 9 Feb 2025).

6. Quantitative Effects, Ablation Findings, and Best Practices

Empirical studies converged on several best practices for FIM tokenization:

Training with a 50–90% FIM example rate yields no L2R degradation, enables robust context modeling, and boosts infilling up to 0.9 pass@100 on single-line code infill (Bavarian et al., 2022, Guo et al., 2024).
Mixing PSM and SPM formats gives maximal flexibility; PSM guards against tokenization bias, whereas SPM provides seamless insertions when byte-level sampling is properly handled (Phan et al., 2024).
Syntax- or curriculum-aware masking via ASTs or hard span mining closes the gap between pretraining and real code editing, improving performance most for small-parameter models (Gong et al., 30 May 2025, Sagtani et al., 2024).
Explicit instruction tokens must be delimited by a standalone marker; inline-comment instructions degrade performance (Sun et al., 29 Sep 2025).
For character-level infilling, always eliminate mid-token boundary prediction by aligning prompts to line/word boundaries (Ren et al., 2024).
Tiny architectural changes (e.g., adding regression heads for horizon prediction or concatenating MLM-AR embeddings) bring significant boundary-awareness and mask-invariant infilling (Ding et al., 2024, Israel et al., 9 Feb 2025).

7. Impact, Open Problems, and Research Frontiers

FIM tokenization underpins the rapid progress in code-generating LLMs, supporting real-world IDE workflows, automated patching, multi-line and reasoning-heavy tasks. It enables bidirectional context utilization with minimal model overhead. Its influence extends into curriculum design for preference optimization (DPO) (Ren et al., 27 Aug 2025), grammar-constrained decoding (Melcer et al., 2024), and efficiency-optimized serving infrastructure (Guo et al., 28 May 2025).

Open challenges include scaling FIM to complex semantic edits, integrating deeper type/data/control-flow signals, developing adaptive span selection for diverse codebases, and parameter-efficient adaptation to new programming languages or mathematical domains. The increasing sophistication of FIM—through grammar, tokenization, curriculum, and auxiliary loss refinements—continues to drive gains well beyond the basic data transformation, making FIM tokens a foundational concept in the contemporary LLM research toolkit.