MRoPE-Interleave: Axis-Interleaved Positional Encoding

Updated 2 February 2026

The paper introduces MRoPE-Interleave, an axis-interleaved rotary positional encoding that balances frequency assignments across temporal, horizontal, and vertical axes.
It applies a cyclic frequency interleaving mechanism that improves positional coherence and full frequency utilization, boosting performance in vision-language, GUI grounding, and compressive imaging tasks.
Empirical evaluations demonstrate significant gains in robustness and accuracy, making MRoPE-I a pragmatic and drop-in enhancement for multimodal transformer models.

MRoPE-Interleave (MRoPE-I) is an axis-interleaved extension of rotary positional encoding designed to provide balanced, expressive, and coherent position representations in models that process multimodal data—specifically text, images, video, and coordinated sensor streams. By interleaving frequency assignments for different positional axes (e.g., temporal, horizontal, vertical), MRoPE-I addresses bandwidth and expressivity bottlenecks seen in earlier chunked or single-axis approaches. Its applications span vision-LLMs (VLMs) for reasoning across long, interleaved contexts, GUI grounding tasks requiring fine spatial precision, and compressive radio interferometric imaging where high-dimensional data must be reduced efficiently without loss of reconstruction fidelity. Empirical evidence demonstrates that MRoPE-I yields significant gains in performance and robustness across a spectrum of benchmarks and architectures (Bai et al., 26 Nov 2025, Huang et al., 27 Oct 2025, Wang et al., 3 Oct 2025, Leblanc et al., 25 Apr 2025).

1. Motivation and Design Principles

Standard Rotary Positional Embedding (RoPE) defines a single positional axis and rotates each query/key vector pair by an angle proportional to its sequence index, leveraging a fixed schedule of frequencies. In multimodal contexts—where tokens encode spatial (height, width), temporal (frame), and sequential (text position) information—simple 1D RoPE does not capture necessary interactions. The original multidimensional RoPE (MRoPE) split the embedding dimension into contiguous chunks, each dedicated to a positional axis (e.g., $D_t$ , $D_h$ , $D_w$ ), but this led to imbalanced frequency representation, with low and high frequencies confined to separate axes and degrading long-video performance and spatial reasoning (Bai et al., 26 Nov 2025, Wang et al., 3 Oct 2025).

MRoPE-Interleave solves this by cycling the frequency-axis assignments so that every axis shares the full spectrum of rotary frequencies. The three core design desiderata are:

Positional coherence: Each self-attention head encodes axis information consistently and avoids conflicts caused by naive mixing.
Full frequency utilization: All axes utilize the complete frequency range, ensuring broad and fine positional discriminability.
Preservation of textual priors: For pure text tokens, MRoPE-I reduces to vanilla RoPE, maintaining high transferability from pretrained LLMs (Huang et al., 27 Oct 2025).

2. Mathematical Formulation and Implementation

2.1. Axis-Frequency Interleaving

Let $d$ be the (even) model dimension, yielding $d/2$ frequency pairs. For base frequencies $\omega = [\omega_1, ..., \omega_{d/2}]$ and axis assignment vector $a \in \{\text{t, h, w}\}^{d/2}$ , the embedding for token position $p = (p_t, p_h, p_w)$ becomes:

$E(p)x = \bigoplus_{j=1}^{d/2} \begin{pmatrix} \cos(\omega_j p_{a_j}) & -\sin(\omega_j p_{a_j}) \ \sin(\omega_j p_{a_j}) & \cos(\omega_j p_{a_j}) \end{pmatrix} x_{[2j-1:2j]}$

where $a_j$ cycles among axes—e.g., for 3D $(t, h, w)$ , $a_j = [t, h, w, t, h, w, ...]$ and for 2D $(h, w)$ , $a_j = [h, w, h, w, ...]$ . Frequencies are typically $\omega_j = 1/(10000^{2(j-1)/d})$ or $b^{-2j/d}$ for base $b$ (Huang et al., 27 Oct 2025, Wang et al., 3 Oct 2025).

2.2. Transformer Attention Integration

MRoPE-I is applied identically to canonical RoPE: it rotates Q and K before the scaled dot-product attention. The only modification is to use axis-specific position indices per channel. The attention scores and value projections remain unchanged:

$\text{Attn}(Q_{\text{rot}}, K_{\text{rot}}, V) = \text{softmax}\left( \frac{Q_{\text{rot}} K_{\text{rot}}^\top}{\sqrt{d}} \right) V$

Implementation Sketch

def interleaved_mrope(x, pos_axis, freq_table):
    batch, seq_len, d = x.shape
    assert d % 2 == 0
    d2 = d // 2
    x_rot = zeros_like(x)
    for idx in range(batch):
        for i in range(seq_len):
            t, h, w = pos_axis[idx, i]
            for j in range(d2):
                axis = ['t','h','w'][j % 3]
                p = t if axis == 't' else h if axis == 'h' else w
                f = freq_table[axis][ j // 3 ]
                theta = p * f
                cosθ, sinθ = cos(theta), sin(theta)
                x0, x1 = x[idx,i,2*j], x[idx,i,2*j+1]
                x_rot[idx,i,2*j]   = x0*cosθ - x1*sinθ
                x_rot[idx,i,2*j+1] = x0*sinθ + x1*cosθ
    return x_rot

(Bai et al., 26 Nov 2025, Huang et al., 27 Oct 2025)

3. Comparative Analysis with Prior Schemes

Scheme	Axis Assignment	Frequency Utilization
RoPE	Single axis (t)	Full for t only
MRoPE	Chunked (t/h/w)	Contiguous, per axis
MHRoPE	Per-head axis	Head axis only
MRoPE-I	Interleaved (t/h/w)	Uniform for all axes

MRoPE-I improves over the original MRoPE by spreading both low and high frequencies across axes, preventing overconcentration on any single axis and supporting superior long-context modeling (Bai et al., 26 Nov 2025, Huang et al., 27 Oct 2025). Mode-Head RoPE (MHRoPE) specializes heads to single axes, but dead channels occur during cross-axis attention; interleaving in MRoPE-I eliminates this inefficiency.

4. Empirical Performance and Evaluation

MRoPE-Interleave demonstrates quantifiable improvements in multiple domains:

Vision-LLMs: Switching from chunked MRoPE to interleaved MRoPE in Qwen3-VL boosts long-context video retrieval (e.g., Needle-in-a-Haystack: 100% accuracy up to 256K tokens; 99.5% at 1M tokens), increases performance on MVBench and VideoMMME, and matches 8B interleaved against 72B chunked counterparts (Bai et al., 26 Nov 2025).
GUI Grounding: On ScreenSpot benchmarks, interleaved MRoPE alone increases average coordinate accuracy by 0.3–0.5 percentage points over standard MRoPE (e.g., 83.0% vs. 82.5%), with the largest gains when paired with explicit coordinate tokens (Wang et al., 3 Oct 2025).
General Multimodal Benchmarks: In systematic ablations, balanced axis allocation (e.g., t:h:w=24:20:20) yields higher aggregate scores in image, video, and grounding tasks compared to unbalanced splits (Huang et al., 27 Oct 2025).

5. Theoretical Properties and Implementation Guidelines

Symmetry Property: By interleaving, each spatial and temporal axis receives frequency support at all resolution scales, ensuring positional bandwidth symmetry (Wang et al., 3 Oct 2025).
Compatibility: MRoPE-I is a parameter-free, drop-in replacement for any RoPE/standard MRoPE layer, preserving language pretraining behavior for pure text tokens.
Hyperparameters: Default base $b=10^4$ and dimension $d$ match pretrained settings; interleave pattern cycles every 2 (h/w) or 3 (t/h/w) axes per frequency channel.
Integration: Apply MRoPE-I at all transformer layers, replacing vanilla RoPE calls. Text-only tokens (with equal axes) automatically use standard RoPE rotations (Huang et al., 27 Oct 2025).

6. Application in Compressive Radio Interferometric Imaging

MRoPE-Interleave generalizes to compressive data acquisition. In radio interferometry:

Random rank-one projections (ROPs) reduce $Q\times Q$ covariances to $P$ projections per batch; temporal interleaving via $M$ random modulations further compresses data over $B$ batches.
This pipeline achieves data reduction: spatial ratio $P/V$ , temporal ratio $M/B$ , and overall $(PM)/(VB)$ , enabling storage of only 1% of classical visibilities at comparable image recovery fidelity when $PM \approx N$ (Leblanc et al., 25 Apr 2025).
Noise whitening is guaranteed, with the Gramian of the measurement operator unitary after axis-frequency interleaving.
Reconstruction quality (SNR/logSNR) from MRoPE-I matches classical schemes provided $PM \gtrsim N$ ; computational cost is reduced by over an order of magnitude (Leblanc et al., 25 Apr 2025).

7. Limitations and Future Directions

MRoPE-I requires manual allocation of axis-frequency assignments, with empirically robust splits (e.g., 24:20:20) but no guarantee of optimality for all task structures (Huang et al., 27 Oct 2025). The pattern is static; adaptive or learned axis assignments and dynamic frequency reallocation represent open areas for research. MRoPE-I performs well in 2D/3D modalities, but extensions to arbitrarily structured data (e.g., multi-sensor fusion, higher-order tensor contexts) may require further innovations (Huang et al., 27 Oct 2025).

8. Summary

MRoPE-Interleave systematically rebalances positional encoding bandwidth across multiple axes by cycling frequency assignments, directly enhancing the cross-modal and long-context expressivity of attention-based models. It achieves consistent empirical gains in VLMs, GUI automation, and data-compressed scientific imaging, with minimal implementation complexity and no loss of backward compatibility. Its principled approach to frequency-axis coupling robustly solves limitations inherent in predecessor schemes and paves the way for scalable, multimodal sequence modeling (Bai et al., 26 Nov 2025, Huang et al., 27 Oct 2025, Wang et al., 3 Oct 2025, Leblanc et al., 25 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (4)

Qwen3-VL Technical Report (2025)

Revisiting Multimodal Positional Encoding in Vision-Language Models (2025)

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping (2025)

MROP: Modulated Rank-One Projections for compressive radio interferometric imaging (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MRoPE-Interleave (MRoPE-I).