ModernALBERT: Compact Recursive Transformer

Updated 21 December 2025

ModernALBERT is a compact recursive transformer that enhances ALBERT by integrating Mixture-of-LoRA for token-conditional weight modulation.
Its innovative MoL mechanism employs expert routing and low-rank adapters to selectively augment shared FFN layers, ensuring efficient conditional computation.
Empirical evaluations demonstrate that ModernALBERT outperforms larger and fully parameterized models on GLUE, SQuAD-v2, and BEIR benchmarks despite its sub-120M parameter footprint.

ModernALBERT is a compact recursive transformer architecture that integrates a Mixture of LoRA (Low-Rank Adaptation) mechanism with modern transformer enhancements, aiming to address the expressivity limitations imposed by aggressive parameter sharing characteristic of the ALBERT model class. The design achieves state-of-the-art performance among sub–120M-parameter LLMs, utilizing token-conditional expert modulation within a highly parameter-efficient framework (Nouriborji et al., 14 Dec 2025).

1. Architectural Design and Model Variants

ModernALBERT employs recursive parameter sharing, following ALBERT, where a single self-attention block and a single feed-forward network (FFN) are applied $N$ times to achieve depth, reducing unique parameters by a factor of $N$ compared to full-parameterization. Four public variants are defined by depth $N$ , hidden dimension $d$ , and FFN width $4d$:

Variant	Recursion Depth $N$	Hidden Dim $d$	FFN Width $4d$
Tiny	7	1,152	4,608
Medium	12	2,624	10,496
Base	24	2,624	10,496
Large	48	2,624	10,496

In all cases, the Mixture-of-LoRAs (MoL) module replaces the shared FFN at a select subset of recursion positions, introducing parameter-efficient conditional computation. For example, ModernALBERT-Tiny inserts MoL at recursions 6 and 7, Medium/Base at recursions 3 and 4, and Large at recursions 3, 4, 5, and 6. This sparse insertion retains the benefits of recursive sharing while restoring expressivity at key depths.

2. Mixture of LoRAs (MoL) Mechanism

Each MoL layer retains the shared FFN projections $W_{\text{down}} \in \mathbb{R}^{4d \times d}$ , $W_{\text{up}} \in \mathbb{R}^{d \times 4d}$ , but augments them with $E$ LoRA experts, each introducing rank- $r$ correction deltas. Specifically, each expert $i$ comprises two low-rank adapters:

For “down” projection: $A_{i,1} \in \mathbb{R}^{r \times d}$ , $B_{i,1} \in \mathbb{R}^{4d \times r}$
For “up” projection: $A_{i,2} \in \mathbb{R}^{r \times 4d}$ , $B_{i,2} \in \mathbb{R}^{d \times r}$

Given token embedding $h \in \mathbb{R}^d$ , a router MLP produces routing scores $p(h) \in \mathbb{R}^E$ , sparsified by top- $k$ ( $k=1$ for Tiny, $k=2$ otherwise) and renormalized to yield gating weights $\alpha_i(h)$ . The weight update per token is:

$W_{\text{down}}'(h) = W_{\text{down}} + \sum_{i=1}^E \alpha_i(h) B_{i,1}A_{i,1}$

$W_{\text{up}}'(h) = W_{\text{up}} + \sum_{i=1}^E \alpha_i(h) B_{i,2}A_{i,2}$

The MoL-augmented FFN computes: $\text{FFN}_\text{MoL}(h) = W_{\text{up}}'(h) \left[ \text{GeGLU}(W_{\text{down}}'(h)h) \right]$

This structure enables token-conditional, expert-dependent modulation of shared parameters without full parameter untangling.

3. Integrated Modern Transformer Practices

ModernALBERT incorporates several contemporary neural architectural advancements:

Rotary Position Embeddings (RoPE): Applied to all self-attention heads, replacing queries $Q$ and keys $K$ with rotationally embedded variants: $Q_\text{rot} = (Q \circ \cos\theta) + (\text{rotate}(Q) \circ \sin\theta)$ and analogously for $K_\text{rot}$ , enhancing positional representation.
Gated-GELU (GeGLU): Used in all FFNs, computing $\text{GeGLU}(x) = (xW_1) \circ \text{GELU}(xW_2)$ , imparting improved nonlinearity versus standard GELU or ReLU activations.
FlashAttention: Employs a fused kernel for exact, I/O-efficient attention computation, streaming blocks of $Q, K, V$ to avoid materializing the full $N \times N$ intermediate matrix, optimizing for both speed and memory consumption.

These enhancements are combined with the MoL approach in all ModernALBERT variants.

4. Initialization and Training Strategy

Shared parameters are initialized via distillation-based methods. A pretrained ModernBERT (“teacher”) is used, with parameters mapped in a stepwise fashion (Bae et al., 2024), followed by further distillation using soft logits (KL-divergence loss) alongside the masked language modeling objective. Optimization employs AdamW with a warmup and linear decay to peak learning rates $5 \times 10^{-4},\ 5 \times 10^{-5}$ depending on schedule. All experiments use LoRA rank $r=8$ , number of experts $E=8$ (Tiny uses $E=4$ ), and routing sparsity $k=1$ (Tiny) or $k=2$ (others).

Training uses a pre-training budget of $30$B tokens (significantly less than ModernBERT’s $1.7$T), with a two-phase corpus: RedPajama-1T (20–30k steps), then RefinedWeb (70–80k steps). Global batch size is $384$, with context sequence length $1024$.

5. Inference-Time Expert Merging

For efficient deployment, per-token routing overhead is eliminated via expert-merging, collapsing $E$ experts into a single static adapter. Two strategies are presented:

Uniform Averaging: Compute $\bar{A} = \frac{1}{E}\sum_{i=1}^E A_i$ , $\bar{B} = \frac{1}{E}\sum_{i=1}^E B_i$ for both projections, then $W' = W + \bar{B}\bar{A}$ .
EMA-Weighted Merging: Track average router probabilities $r_i = \frac{1}{T}\sum_{t=1}^T p_i(h_t)$ per batch, keep an exponential moving average $w \leftarrow \alpha w + (1-\alpha)\bar{r}$ (normalize $\sum w_i=1$ ), and merge by weighted sum.

Uniform averaging retains at least $99\%$ of original accuracy, allowing practical inference without substantial accuracy loss.

6. Empirical Evaluation

ModernALBERT achieves state-of-the-art results among compact models ( $<120$ M parameters) across GLUE, SQuAD-v2, and BEIR. Key results include:

GLUE (Unweighted Avg):
- Tiny (50M): 84.95
- Medium (55M): 86.21
- Base (75M): 87.70
- Large (120M): 88.72
- Baselines: BERT-base (110M) 84.84; ModernBERT-base (149M) 88.45.
SQuAD-v2 (F1 / Exact Match):
- Tiny: 90.0 / 82.9
- Medium: 90.4 / 82.9
- Base: 92.8 / 86.1
- Large: 92.9 / 85.9
- Baselines: BERT-base 88.6 / 80.6; RoBERTa-base 91.7 / 84.7; ALBERT-xxlarge 92.5 / 84.5; ModernBERT-base 92.6 / 85.2.
BEIR (Average over 6 Tasks):
- ModernALBERT-base (75M): 46.66
- ModernBERT-base (149M): 41.6
- BERT-base (110M): 38.9
- On ArguAna retrieval: ModernALBERT 48.82 vs. ModernBERT 35.7.

These metrics demonstrate that ModernALBERT matches or surpasses fully parameterized baselines and larger models in its class on a range of language understanding and retrieval tasks.

7. Significance and Implications

ModernALBERT synthesizes ALBERT’s recursive parameter efficiency with conditionally modulated expressivity via MoL, leveraging modern architectural best practices for competitive downstream performance. The approach shows that token-conditional weight-space modulation can compensate for the collapse in expressivity otherwise induced by recursive parameter sharing, with the additional benefit of rapid convergence (30B tokens) and seamless inference-stage compression via expert merging. This suggests efficient recursive transformers need not be restricted by fixed layers’ lack of expressivity and can achieve strong results in both general language understanding and retrieval without the scale requirements or deployment cost of conventional full-parameter models (Nouriborji et al., 14 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Improving Recursive Transformers with Mixture of LoRAs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ModernALBERT.