Do multiple alternations in hybrid attention–GDN stacks increase expressivity?

Establish whether fixed-depth hybrid language models that alternate between self-attention layers and Gated DeltaNet (with negative eigenvalues) layers gain strictly more expressive power when they include more than one alternation between the two layer types (e.g., attention→GDN→attention) compared to architectures with only a single alternation (e.g., attention→GDN or GDN→attention), under the paper’s log-precision arithmetic setting and standard complexity-theoretic assumptions.

Background

The paper proves that a single alternation between Gated DeltaNet (with negative eigenvalues) and attention layers suffices to represent the state-based recall task, which neither pure transformers nor pure linear RNNs can solve under standard assumptions.

After establishing this separation (Theorem 1), the authors note that while one alternation suffices for the exhibited task, it is unknown whether allowing more alternations would further increase expressivity beyond the single-alternation case.

References

It is an open question whether multiple alternations between layer types buy more expressivity than having just one alternation.

— Olmo Hybrid: From Theory to Practice and Back (2604.03444 - Merrill et al., 3 Apr 2026) in Section 3.3 (Hybrid Models are More Than the Sum of Their Parts)

Do multiple alternations in hybrid attention–GDN stacks increase expressivity?

Background

References

Related Problems