Dynamic Chunked Attention Masking

Updated 22 January 2026

Dynamic chunked attention masking is a sparsity method that partitions sequences into fixed or adaptive blocks, reducing the quadratic complexity of self-attention.
It leverages content-aware and position-driven schemes to dynamically generate masks per input, batch, or layer for efficient long-context modeling.
This approach benefits speech recognition and language modeling by lowering memory usage and boosting inference speed while maintaining accuracy.

Dynamic chunked attention masking is a class of sparsity-inducing strategies for self-attention that impose block- or chunk-wise constraints on the attention matrix, with chunk structure determined either statically or adaptively per input, per batch, or even per layer. The goal is to reduce the quadratic computational and memory complexity of standard dense attention while enabling efficient, accurate modeling of long input sequences. Recent advances in this area leverage content-driven or position-aware dynamic chunking schemes, enable batchwise masking on heterogeneous-length sequences, and provide neural or semi-heuristic mechanisms to dynamically select attention regions. Dynamic chunked attention masking now underpins scalable training and inference across speech recognition, long-context language modeling, and time-series sequence modeling tasks.

1. Formalism and Rationale for Chunked Attention Masking

Chunked attention masking restricts each token or frame to attend only to positions within defined "chunks"—contiguous or semantically coherent subsequences of the input—rather than the entire sequence. For a sequence of length $L$ , the input is partitioned into $n = \lceil L / c\rceil$ non-overlapping or overlapping core chunks of size $c$ , typically augmented with configurable left and/or right context extensions. The chunked attention mask $M$ is a binary or log-binary matrix, elementwise specifying which key positions are visible to each query index. For dynamic schemes, chunk boundaries, context sizes, and masking patterns can vary per batch, per sequence, or per attention head.

This masking reduces the complexity of self-attention from $O(L^2 d_k)$ to $O(L c d_k)$ or less, depending on context sizes and mask sparsity. The design aims to control the local/global receptive field, balance latency and accuracy, and mitigate padding and resource inefficiencies—challenges that become acute in tasks like long-form automatic speech recognition (ASR) or long-context LLM inference (Le et al., 20 Feb 2025, Xiong et al., 28 Oct 2025, Sharma et al., 2024, Shi et al., 4 Aug 2025, Ju et al., 2021, Swietojanski et al., 2022, Le et al., 21 Feb 2025).

2. Dynamic Chunk Boundary and Mask Generation Approaches

Chunk Partitioning Strategies

Fixed-size chunks: The sequence is divided into pre-specified, non-overlapping blocks of length $c$ . Each query attends to a local window set by chunk boundaries and context extensions (Ju et al., 2021, Le et al., 20 Feb 2025).
Variable (adaptive) chunks: Chunk boundaries are dynamically predicted based on content or contextual signals, e.g., via an MLP boundary detector operating over local key windows (Xiong et al., 28 Oct 2025). Each chunk $C_j$ then collects indices between predicted boundary positions.

Example (dynamic chunk detection, as in DHSA (Xiong et al., 28 Oct 2025)):

$p_i = \sigma(\operatorname{MLP}(h_i)), \text{ where } h_i = \lbrack \text{left context}, \text{right context}, |k_{\text{left}} - k_{\text{right}}|, k_{\text{left}}\odot k_{\text{right}}, \cos(k_{\text{left}}, k_{\text{right}})\rbrack$

Non-maximum suppression over $p_i$ yields the chunk endpoints.

Dynamic Mask Construction

Chunked Attention Mask: For chunk $i$ (core position starting at $i\cdot c$ ), with left/right context $l_{\text{att}}, r$ :

$M^{(i)}_{j, t} = \begin{cases} 1 & \text{if } t \in [l_{\text{att}}-j,\; l_{\text{att}}-j+(c+r)] \ 0 & \text{otherwise} \end{cases}$

where $j$ is the query offset within the chunk and $t$ indexes context+chunk+future positions (Le et al., 20 Feb 2025).

Batchwise Masking: For heterogeneous-length batch processing, all chunk masks are computed in a single stride (gather) operation, and nonoverlapping regions and invalid overlaps are masked out—eliminating the need for padding to the longest utterance.
Content-aware Masks: Trainable mask layers (e.g., Dynamic Mask Attention) infer, per head, a binary vector of critical positions using scores derived from projected value tensors. Only the top- $w$ tokens (or top- $K$ chunks) per head are made visible, learning both inter- and intra-chunk sparsity adaptively (Shi et al., 4 Aug 2025).
Hierarchical Masking: Block- or chunk-level summed mask scores are used to mask entire blocks, further reducing calculations—an approach effective for very long sequences (Shi et al., 4 Aug 2025, Sharma et al., 2024).

3. Integration into Transformer Architectures and Streaming Models

Dynamic chunked attention masking mechanisms are implemented as pre-softmax operations on the attention logits in multi-head self-attention layers. Invalid entries are set to a large negative constant (often $-\infty$ ), so softmax yields zero weight for masked key positions.

In encoders like the Conformer, masking is used in both the attention and convolution modules, ensuring that each chunk only interacts with its defined receptive field (Le et al., 20 Feb 2025). In multi-stage or curriculum designs (e.g., ChunkFormer (Ju et al., 2021)), chunk sizes can grow across layers (progressive expansion), thus enabling the model to learn increasingly global representations.

For streaming and online ASR, dynamic chunked attention masking is combined with time-shifted context or "dynamic right context" masking (DRC): a variable number of future frames can be exposed per chunk, and the masking is adapted on-the-fly according to latency and lookahead constraints (Le et al., 21 Feb 2025). Sampling mask templates during training (variable masking) enables deployment-time configurability between low-latency streaming and high-accuracy offline modes using a single model checkpoint (Swietojanski et al., 2022).

4. Resource and Computational Efficiency

Dynamic chunked attention masking delivers substantial reductions in time and memory complexity compared to dense attention, especially for long and variable-length input sequences.

Memory and compute: In ChunkFormer (Le et al., 20 Feb 2025), masked batch processing reduces memory and time usage by more than $3\times$ compared to naive batching, with a similar reduction in FLOPs (e.g., from 73.4 GB to 19.6 GB for mixed-length batches).
Long-context scaling: ChunkFormer handles up to 16 hours of audio on an 80GB GPU (15x baseline Conformer), with accuracy preservation on short tasks (Le et al., 20 Feb 2025). In LLMs, dynamic masking via Flash Attention with block-level skipping yields up to 9x runtime speedups on real-world, partially-filled mask workloads (Sharma et al., 2024).

The overall cost of mask construction is dominated by local windowing and content scoring operations; per-head top-K selection and blockwise mask aggregation can be GPU-accelerated (Shi et al., 4 Aug 2025, Sharma et al., 2024). Content-adaptive chunk selection with chunk-level upsampling adds minimal additional compute to the overall $O(n w d_h)$ cost of sparse attention (Xiong et al., 28 Oct 2025, Shi et al., 4 Aug 2025).

5. Impact on Model Accuracy and Long-Form Sequence Performance

Dynamic chunked masking not only enables efficient scaling but also preserves or improves accuracy on long-form, non-stationary, or irregular sequence data.

Speech recognition: In (Le et al., 20 Feb 2025), ChunkFormer reduced long-form ASR WER by 7.7% absolute (Earnings-21 dataset), outperforming FastConformer and vanilla Conformer on long utterances.
Language modeling and sequence recall: DHSA (Xiong et al., 28 Oct 2025) exactly matched dense attention in accuracy in needle-in-a-haystack and recall tasks, and exhibited 6–18% relative accuracy gains over static block-sparse baselines on resource-constrained LLMs.
Streaming ASR: DRC masking with TSCA (Le et al., 21 Feb 2025) yielded a 13.9% relative WER reduction over fixed chunk masks on Librispeech with no increase in perceived latency.
Semi-supervised sequence learning: In spatiotemporal and timeseries tasks, dynamic attention-based masking improved accuracy by 1–2 points over random masking and static alternatives, due to forced robustness to occlusion of "most important" input regions (Forstenhäusler et al., 14 Apr 2025).

Empirical results consistently show that content- or context-driven mask adaptivity confers significant gains over static chunk/block methods, especially in the presence of long-range dependencies, variable-length inputs, and irregular or non-stationary data (Xiong et al., 28 Oct 2025, Le et al., 21 Feb 2025, Forstenhäusler et al., 14 Apr 2025).

6. Design Trade-offs, Versatility, and Implementation Considerations

Design variables include chunk size $c$ , left/right context $l_{\text{att}}, r$ , mask construction granularity (token vs. chunk), and mask selection strategy (fixed, variable, hybrid, or content-driven). Larger chunks reduce overhead but may hurt learning on short sequences; more context increases accuracy but at higher memory/compute cost (Le et al., 20 Feb 2025). For long-context LLMs, chunk-size ( $B$ ) and the number of dynamically selected chunks ( $K$ ) balance global context and hardware efficiency (Xiong et al., 28 Oct 2025, Shi et al., 4 Aug 2025).

Dynamic chunked masking can be implemented efficiently by:

Precomputing per-block mask indicators and leveraging contiguous mask patterns (for batch inference) to avoid unnecessary memory access (Sharma et al., 2024).
Adapting mask boundaries per batch/sequence to align with semantic breaks or speech boundaries for increased context fidelity (Xiong et al., 28 Oct 2025).
Sampling mask template sets during training, enabling a single model instance to adapt at runtime to different latency/accuracy configurations without retraining ("mask configurability") (Swietojanski et al., 2022).

Implementation is straightforward in modern attention frameworks: masks are simply added to the attention logits prior to softmax, and sparse compute kernels can exploit block or chunk structure to maximize throughput (Le et al., 20 Feb 2025, Sharma et al., 2024, Shi et al., 4 Aug 2025).

7. Extensions and Research Directions

Dynamic chunked attention masking is extending beyond standard ASR and LLM usage into hierarchical sequence modeling, contrastively regularized learning, and structurally adaptive attention mechanisms:

Hierarchical chunking: Multi-stage expanding-chunk schemes progress from local to global pattern extraction, supporting curriculum-like sequence modeling (Ju et al., 2021).
Contrastive augmentation: Dynamic attention-based regional masking, which occludes model-identified salient chunks during training, augments semi-supervised contrastive learning and improves out-of-distribution robustness (Forstenhäusler et al., 14 Apr 2025).
Fully differentiable approaches: Dynamic sparse attention with learnable mask layers integrates end-to-end with Transformer models and backpropagation, streamlining applicability across architectures (Shi et al., 4 Aug 2025).
On-device and latency-sensitive inference: Adaptive chunk partitioning and runtime mask sampling (DHSA, DRC) are key enablers for on-device LLMs and very-long-form streaming ASR (Xiong et al., 28 Oct 2025, Le et al., 21 Feb 2025, Swietojanski et al., 2022).

As large-scale sequence modeling continues to expand in length and complexity, dynamic chunked attention masking will remain central in delivering the efficiency, scalability, and adaptability necessary for practical deployment.