Do multiple alternations in hybrid attention–GDN stacks increase expressivity?
Establish whether fixed-depth hybrid language models that alternate between self-attention layers and Gated DeltaNet (with negative eigenvalues) layers gain strictly more expressive power when they include more than one alternation between the two layer types (e.g., attention→GDN→attention) compared to architectures with only a single alternation (e.g., attention→GDN or GDN→attention), under the paper’s log-precision arithmetic setting and standard complexity-theoretic assumptions.
References
It is an open question whether multiple alternations between layer types buy more expressivity than having just one alternation.
— Olmo Hybrid: From Theory to Practice and Back
(2604.03444 - Merrill et al., 3 Apr 2026) in Section 3.3 (Hybrid Models are More Than the Sum of Their Parts)