Bidirectional Masking & Temporal/Segmental Encoding

Updated 17 February 2026

Bidirectional masking and temporal/segmental encoding are techniques that integrate local details with higher-order dependencies using adaptive attention masks across segments.
They have been applied to diverse modalities—improving speech recognition error rates, enhancing language segmentation F1 scores, and boosting dialogue and video generation efficiency.
The flexible design of these masking schemes reconciles causal and bidirectional attention strengths, leading to more efficient and scalable architectures for structured data.

Bidirectional masking and temporal/segmental encoding define a set of architectural, methodological, and masking innovations for modeling structured data—speech, language, or video—where representations must capture both local detail and higher-order, mid-length dependencies. Bidirectional masking mechanisms permit information flow from both forward and backward temporal directions (or bidirectional among segments), in contrast to purely causal/left-to-right or fully global schemes, while temporal/segmental encoding imposes explicit or implicit boundaries corresponding to linguistic, semantic, or visual segments. These concepts are realized differently across modalities, either at the level of network pre-training, attention matrix construction, or hybrid masking-control in transformer-like computational graphs.

1. Bidirectional Masking: Principles and Instantiations

Bidirectional masking allows a model to encode or condition on information present both before and after a given position, as opposed to strictly left-to-right (causal) attention. The following variants exemplify the spectrum:

Span-masking for language segmentation: The Masked Segmental LLM (MSLM) uses a custom attention mask $M$ to preclude each token from attending to the next $K$ tokens, enforcing that information about a possible span is derived from its outside (past and future), producing a masked context window for segmental modeling (Downey et al., 2021). Here, $M_{i,j} = -\infty$ if $0 < j-i \leq K$ , otherwise 0.
Temporal/frequency masking for speech: In bidirectional speech encoders, masking is applied simultaneously along time (temporal) and frequency axes; segments of the spectrogram are set to zero, both for contiguous time-frames and frequency bins. The reconstruction loss is then defined on just the (masked) portions. This necessitates that bidirectional LSTM encoders propagate information across missing regions, leveraging both prior and succeeding context (Wang et al., 2020).
Block/symmetric masking in Transformers: In multi-scene video generation, Mask $^2$ DiT introduces a symmetric binary block mask $M_{\rm sym}$ , ensuring bidirectional self-attention within each segment (scene+prompt), but blocking attention across segment boundaries unless explicitly allowed for global structures (e.g., all video tokens may attend mutually) (Qi et al., 25 Mar 2025). Segment-level causal masks $M_{\rm cond}$ further control information flow at the scene level.
Hybrid bidirectional/causal alternation: In dialogue LLMs, Intermittent Semi-working Masking (ISM) alternates bidirectional masking for user queries (allowing global context within query turns) with purely autoregressive (causal) masking for corresponding model answers (Lu et al., 2024).

2. Temporality and Segmental Encoding Mechanisms

Temporal and segmental encodings refer to architectural or masking-level mechanisms that enforce or exploit the presence of segment-like structure and time-localized patterns:

Segment masks and embeddings: In Segment-Based Attention Masking (MAS), prompts are split into blocks (segments), and the masking logic ensures tokens can attend bidirectionally within their own block and unidirectionally to subsequent blocks, but may not access preceding block tokens (Katz et al., 2024). Segmental encoding is implicit in mask structure; additional explicit segment embeddings can be introduced but are not required for functionality.
Sinusoidal and gated position encodings: MSLM uses sinusoidal positional encodings combined with a learned gating term to augment embeddings, providing both token-level and segment boundary information. The span-based masking ensures that at each location, the transformer can model possible segment boundaries without explicit segment ID input (Downey et al., 2021).
Temporal–frequency integration in speech: Contiguous temporal masking prevents speech encoders from relying on strictly local context, thereby forcing the internal states to encode information that bridges across speech segments—such as phonemes or sub-phonetic elements (Wang et al., 2020).
Segmental conditioning in video generation: Mask $^2$ DiT applies segment-level conditional masks to restrict attention so that, during autoregressive scene extension, only the latest segment attends to itself, while all earlier segments are "locked" and serve as static context. This block-diagonal causal structure (in segment space) replaces traditional tokenwise lower-triangular masking (Qi et al., 25 Mar 2025).

3. Masking Strategies and Attention Matrix Construction

Mask design is crucial for controlling model access patterns:

Method	Mask Structure	Scope of Bidirectionality
MSLM (Downey et al., 2021)	Span mask $M$ (window)	Bi-directional outside $K$
Bidirectional speech	Time/freq. binary mask	Bi-directional in BiLSTM
MAS (Katz et al., 2024)	Block-causal mask	Bi-directional within blocks
ISM (Lu et al., 2024)	Alternating mask	Bi-dir. (query), causal (answer)
Mask $K$ 0DiT (Qi et al., 25 Mar 2025)	Block symmetric + cond.	Bi-dir. within $K$ 1 causal across

Contextually, segment-based or blockwise masking often leads to a non-trivial block structure, in contrast to full or strictly lower-triangular masks. The application of segmental masks in self-attention requires efficient implementation: block lookup, sparse indexing, or groupwise attention to avoid quadratic memory scaling (Qi et al., 25 Mar 2025).

In dialog systems (ISM), mask values are set so that for a position $K$ 2 in a query segment, tokens can attend to entire prior queries and the prompt ( $K$ 3). If $K$ 4 is in an answer segment, only causal access is permitted ( $K$ 5). Implementation entails dynamically constructing $K$ 6 masks tailored to boundary indices at both training and inference (Lu et al., 2024).

4. Empirical Effects and Theoretical Properties

Bidirectional masking and explicit segmental encoding consistently yield improvements in both unsupervised representation learning and downstream supervised tasks—especially in low-resource, multi-turn, or long-context settings. Salient empirical findings include:

Speech recognition: Masked-reconstruction pre-training in BiLSTM speech encoders achieves substantial phone error rate (PER), character error rate (CER), and word error rate (WER) reductions over supervised-only baselines (e.g., PER drops from 18.52% to 17.18% on WSJ with time/frequency segment masking; CER drops from 15.23% to 13.29% on si84) (Wang et al., 2020).
Language segmentation: MSLM attains 12 F1 improvement over unidirectional SLMs on Chinese (PKU) in lightly supervised settings; ablating bidirectionality degrades bits-per-character, demonstrating the benefit of two-sided context for robust segmental modeling (Downey et al., 2021).
LLM instruction following and dialog: In multi-turn contexts, ISM raises GPT-4 win-rates and metric scores for both causal and prefix LLMs, with notable improvements for longer dialogues and reduced time-to-first-token latency due to improved KV-cache reuse (linear vs quadratic scaling with context size) (Lu et al., 2024). MAS increases accuracy by 1–7 points across diverse commonsense benchmarks, attributed to enhanced context integration in block-bidirectional prefill (Katz et al., 2024).
Video generation with Mask $K$ 7DiT: Blockwise dual masking achieves both segment-level semantic alignment and visually consistent multi-scene generation, outperforming previous DiT architectures that lack explicit cross-segment temporal conditioning (Qi et al., 25 Mar 2025).

This suggests that bidirectional masking offers both superior latent representation and efficient context utilization compared to purely causal or globally bidirectional masking strategies, particularly where segment structure is inherent to the data.

5. Relation to Traditional Architectures and Masking Schemes

Standard transformer architectures employ either fully bidirectional (BERT-style, $K$ 8 everywhere) or strictly causal (GPT-style, lower-triangular) attention masks. Bidirectional/segmental approaches generalize these:

Block/segment masking enables models to mix causal and bidirectional computation in accordance with data structure, such as controlling information flow at segment/chunk boundaries (MAS, Mask $K$ 9DiT, ISM).
Span masking and dynamic attention support per-token or per-span masking associated with hypothesized or observed segmental boundaries, critical for segmentation and unsupervised learning tasks (MSLM, speech pre-training).
Temporal/position encoding is handled either by fixed sinusoids, learned positional embeddings, or through the masking mechanism itself, where offset and segment-aware constraints enable models to distinguish between intra- and inter-segment dependencies (Downey et al., 2021, Qi et al., 25 Mar 2025).

A plausible implication is that these flexible masking schemes reconcile the strengths of both causal and bidirectional models, providing a principled means to encode local and global structure, while optimizing for both representation quality and computational efficiency.

6. Applications and Extensions Across Modalities

Bidirectional masking and temporal/segmental encoding have major implications across modalities:

Speech: Enables robust unit discovery and feature pre-training for low-resource automatic speech recognition, facilitating transfer learning and domain adaptation (Wang et al., 2020).
Unsupervised segmentation: Permits direct tokenization of unsegmented language and speech, yielding high-precision outputs in languages lacking whitespace-delimited word boundaries (Downey et al., 2021).
Dialogue modeling: Supports efficient, context-aware LLMs capable of multi-turn conversational coherence with reduced computational cost (Lu et al., 2024, Katz et al., 2024).
Video generation: Establishes scalable architectures for conditioning or extending long videos as multi-segment compositions, facilitating alignment between temporal segments and external controls (e.g., text prompts) (Qi et al., 25 Mar 2025).

These approaches are extensible to document-level modeling, long-context summarization, structured event prediction, and any application where segmental or hierarchical structure must be respected by the model’s attention or context windows. Continued research is likely to focus on further optimizing mask design and leveraging segment-aware computation for even larger and more complex data streams.