Papers
Topics
Authors
Recent
Search
2000 character limit reached

ModernALBERT: Compact Recursive Transformer

Updated 21 December 2025
  • ModernALBERT is a compact recursive transformer that enhances ALBERT by integrating Mixture-of-LoRA for token-conditional weight modulation.
  • Its innovative MoL mechanism employs expert routing and low-rank adapters to selectively augment shared FFN layers, ensuring efficient conditional computation.
  • Empirical evaluations demonstrate that ModernALBERT outperforms larger and fully parameterized models on GLUE, SQuAD-v2, and BEIR benchmarks despite its sub-120M parameter footprint.

ModernALBERT is a compact recursive transformer architecture that integrates a Mixture of LoRA (Low-Rank Adaptation) mechanism with modern transformer enhancements, aiming to address the expressivity limitations imposed by aggressive parameter sharing characteristic of the ALBERT model class. The design achieves state-of-the-art performance among sub–120M-parameter LLMs, utilizing token-conditional expert modulation within a highly parameter-efficient framework (Nouriborji et al., 14 Dec 2025).

1. Architectural Design and Model Variants

ModernALBERT employs recursive parameter sharing, following ALBERT, where a single self-attention block and a single feed-forward network (FFN) are applied NN times to achieve depth, reducing unique parameters by a factor of NN compared to full-parameterization. Four public variants are defined by depth NN, hidden dimension dd, and FFN width $4d$:

Variant Recursion Depth NN Hidden Dim dd FFN Width $4d$
Tiny 7 1,152 4,608
Medium 12 2,624 10,496
Base 24 2,624 10,496
Large 48 2,624 10,496

In all cases, the Mixture-of-LoRAs (MoL) module replaces the shared FFN at a select subset of recursion positions, introducing parameter-efficient conditional computation. For example, ModernALBERT-Tiny inserts MoL at recursions 6 and 7, Medium/Base at recursions 3 and 4, and Large at recursions 3, 4, 5, and 6. This sparse insertion retains the benefits of recursive sharing while restoring expressivity at key depths.

2. Mixture of LoRAs (MoL) Mechanism

Each MoL layer retains the shared FFN projections WdownR4d×dW_{\text{down}} \in \mathbb{R}^{4d \times d}, WupRd×4dW_{\text{up}} \in \mathbb{R}^{d \times 4d}, but augments them with EE LoRA experts, each introducing rank-rr correction deltas. Specifically, each expert ii comprises two low-rank adapters:

  • For “down” projection: Ai,1Rr×dA_{i,1} \in \mathbb{R}^{r \times d}, Bi,1R4d×rB_{i,1} \in \mathbb{R}^{4d \times r}
  • For “up” projection: Ai,2Rr×4dA_{i,2} \in \mathbb{R}^{r \times 4d}, Bi,2Rd×rB_{i,2} \in \mathbb{R}^{d \times r}

Given token embedding hRdh \in \mathbb{R}^d, a router MLP produces routing scores p(h)REp(h) \in \mathbb{R}^E, sparsified by top-kk (k=1k=1 for Tiny, k=2k=2 otherwise) and renormalized to yield gating weights αi(h)\alpha_i(h). The weight update per token is:

Wdown(h)=Wdown+i=1Eαi(h)Bi,1Ai,1W_{\text{down}}'(h) = W_{\text{down}} + \sum_{i=1}^E \alpha_i(h) B_{i,1}A_{i,1}

Wup(h)=Wup+i=1Eαi(h)Bi,2Ai,2W_{\text{up}}'(h) = W_{\text{up}} + \sum_{i=1}^E \alpha_i(h) B_{i,2}A_{i,2}

The MoL-augmented FFN computes: FFNMoL(h)=Wup(h)[GeGLU(Wdown(h)h)]\text{FFN}_\text{MoL}(h) = W_{\text{up}}'(h) \left[ \text{GeGLU}(W_{\text{down}}'(h)h) \right]

This structure enables token-conditional, expert-dependent modulation of shared parameters without full parameter untangling.

3. Integrated Modern Transformer Practices

ModernALBERT incorporates several contemporary neural architectural advancements:

  • Rotary Position Embeddings (RoPE): Applied to all self-attention heads, replacing queries QQ and keys KK with rotationally embedded variants: Qrot=(Qcosθ)+(rotate(Q)sinθ)Q_\text{rot} = (Q \circ \cos\theta) + (\text{rotate}(Q) \circ \sin\theta) and analogously for KrotK_\text{rot}, enhancing positional representation.
  • Gated-GELU (GeGLU): Used in all FFNs, computing GeGLU(x)=(xW1)GELU(xW2)\text{GeGLU}(x) = (xW_1) \circ \text{GELU}(xW_2), imparting improved nonlinearity versus standard GELU or ReLU activations.
  • FlashAttention: Employs a fused kernel for exact, I/O-efficient attention computation, streaming blocks of Q,K,VQ, K, V to avoid materializing the full N×NN \times N intermediate matrix, optimizing for both speed and memory consumption.

These enhancements are combined with the MoL approach in all ModernALBERT variants.

4. Initialization and Training Strategy

Shared parameters are initialized via distillation-based methods. A pretrained ModernBERT (“teacher”) is used, with parameters mapped in a stepwise fashion (Bae et al., 2024), followed by further distillation using soft logits (KL-divergence loss) alongside the masked language modeling objective. Optimization employs AdamW with a warmup and linear decay to peak learning rates 5×104, 5×1055 \times 10^{-4},\ 5 \times 10^{-5} depending on schedule. All experiments use LoRA rank r=8r=8, number of experts E=8E=8 (Tiny uses E=4E=4), and routing sparsity k=1k=1 (Tiny) or k=2k=2 (others).

Training uses a pre-training budget of $30$B tokens (significantly less than ModernBERT’s $1.7$T), with a two-phase corpus: RedPajama-1T (20–30k steps), then RefinedWeb (70–80k steps). Global batch size is $384$, with context sequence length $1024$.

5. Inference-Time Expert Merging

For efficient deployment, per-token routing overhead is eliminated via expert-merging, collapsing EE experts into a single static adapter. Two strategies are presented:

  • Uniform Averaging: Compute Aˉ=1Ei=1EAi\bar{A} = \frac{1}{E}\sum_{i=1}^E A_i, Bˉ=1Ei=1EBi\bar{B} = \frac{1}{E}\sum_{i=1}^E B_i for both projections, then W=W+BˉAˉW' = W + \bar{B}\bar{A}.
  • EMA-Weighted Merging: Track average router probabilities ri=1Tt=1Tpi(ht)r_i = \frac{1}{T}\sum_{t=1}^T p_i(h_t) per batch, keep an exponential moving average wαw+(1α)rˉw \leftarrow \alpha w + (1-\alpha)\bar{r} (normalize wi=1\sum w_i=1), and merge by weighted sum.

Uniform averaging retains at least 99%99\% of original accuracy, allowing practical inference without substantial accuracy loss.

6. Empirical Evaluation

ModernALBERT achieves state-of-the-art results among compact models (<120<120M parameters) across GLUE, SQuAD-v2, and BEIR. Key results include:

  • GLUE (Unweighted Avg):
    • Tiny (50M): 84.95
    • Medium (55M): 86.21
    • Base (75M): 87.70
    • Large (120M): 88.72
    • Baselines: BERT-base (110M) 84.84; ModernBERT-base (149M) 88.45.
  • SQuAD-v2 (F1 / Exact Match):
    • Tiny: 90.0 / 82.9
    • Medium: 90.4 / 82.9
    • Base: 92.8 / 86.1
    • Large: 92.9 / 85.9
    • Baselines: BERT-base 88.6 / 80.6; RoBERTa-base 91.7 / 84.7; ALBERT-xxlarge 92.5 / 84.5; ModernBERT-base 92.6 / 85.2.
  • BEIR (Average over 6 Tasks):
    • ModernALBERT-base (75M): 46.66
    • ModernBERT-base (149M): 41.6
    • BERT-base (110M): 38.9
    • On ArguAna retrieval: ModernALBERT 48.82 vs. ModernBERT 35.7.

These metrics demonstrate that ModernALBERT matches or surpasses fully parameterized baselines and larger models in its class on a range of language understanding and retrieval tasks.

7. Significance and Implications

ModernALBERT synthesizes ALBERT’s recursive parameter efficiency with conditionally modulated expressivity via MoL, leveraging modern architectural best practices for competitive downstream performance. The approach shows that token-conditional weight-space modulation can compensate for the collapse in expressivity otherwise induced by recursive parameter sharing, with the additional benefit of rapid convergence (30B tokens) and seamless inference-stage compression via expert merging. This suggests efficient recursive transformers need not be restricted by fixed layers’ lack of expressivity and can achieve strong results in both general language understanding and retrieval without the scale requirements or deployment cost of conventional full-parameter models (Nouriborji et al., 14 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ModernALBERT.