Hourglass MLPs: High-Dimensional Residual Refinement

Updated 19 January 2026

Hourglass MLPs are neural network architectures that invert conventional residual block designs by employing wide high-dimensional skip connections and narrow bottleneck paths.
They leverage fixed random projections to efficiently lift input vectors, reducing trainable parameters while preserving geometric properties for robust performance.
Empirical results in generative, denoising, and image restoration tasks highlight their superior expressivity and parameter efficiency over conventional MLPs.

Hourglass MLPs are multi-layer perceptron architectures characterized by an inversion of the conventional block shape, employing a wide–narrow–wide structure. In these designs, residual (skip) connections operate in an expanded high-dimensional latent space, while the learnable computation proceeds through a sequence of narrow bottlenecks. This configuration facilitates highly expressive incremental refinement within a rich latent representation, while optimizing parameter economy and efficiency. Hourglass MLPs leverage fixed random projections into high-dimensional spaces, yielding further savings in trainable parameters and memory bandwidth. Empirical studies demonstrate consistent superiority of Hourglass architectures over conventional MLPs in generative, denoising, and image restoration tasks, with distinctly different scaling behaviors as parameter budgets increase (Chen et al., 2 Oct 2025).

1. Architectural Principles and Motivation

Conventional residual MLP blocks employ a narrow–wide–narrow schema:

Input/output dimension $d_x$ corresponds to token or pixel-vector size.
Hidden expansion $d_h > d_x$ .
Block operation: $x_{i+1} = x_i + W_2\, \sigma(W_1\, \mathrm{norm}(x_i))$ , where $W_1 \in \mathbb{R}^{d_h \times d_x}$ , $W_2 \in \mathbb{R}^{d_x \times d_h}$ .
The skip connection operates at $d_x$ , confining learnable residuals to the input/output space.

Hourglass MLP blocks reverse this configuration:

Use a high-dimensional latent space $d_z \gg d_x$ for the skip connection.
Employ a narrow bottleneck $d_h < d_z$ for the computation pathway.
Structured as:
1. Input lift: $z_0 = W_{\text{in}} x_0$ , $W_{\text{in}} \in \mathbb{R}^{d_z \times d_x}$ .
2. $L$ residual Hourglass blocks: $z_{i+1} = z_i + W_{i,2}\, \sigma(W_{i,1}\, \mathrm{norm}(z_i))$ , $W_{i,1} \in \mathbb{R}^{d_h \times d_z}$ , $W_{i,2} \in \mathbb{R}^{d_z \times d_h}$ .
3. Final projection: $\hat{y} = W_{\text{out}} z_L$ , $W_{\text{out}} \in \mathbb{R}^{d_y \times d_z}$ .

This design enables residual pathways to live in richer, high-dimensional feature spaces, potentially allowing for more expressive incremental corrections. The bottleneck restricts the cost of each block, facilitating greater model depth under a fixed parameter budget.

2. Fixed Random Projection Strategies

Hourglass MLPs frequently employ a fixed random projection $W_\text{in}$ to lift input vectors into the expanded latent space. Theoretical foundations in reservoir computing, random-feature models, Johnson–Lindenstrauss, and compressive-sensing indicate that such projections preserve essential geometric and discriminative properties with high probability, provided $d_z \gg d_x$ .

Key benefits include:

Elimination of trainable parameters for $W_\text{in}$ .
Reduced memory and bandwidth overhead, as random matrices can be generated on-the-fly.
Comparable empirical performance: In ImageNet-32 denoising with $(d_z, d_h, L) = (3546, 270, 5)$ , models with fixed versus trainable $W_\text{in}$ yield nearly identical PSNR curves (difference $\ll 0.1$ dB).

Across evaluated tasks, Hourglass MLPs with fixed projections consistently align with the Pareto frontier of their fully trainable counterparts.

3. Parameter Budget and Computational Complexity

Let $d = d_x$ , $e$ be the expansion ( $d_z = e d$ ), $b = d_h$ the bottleneck width, $L$ the stack depth. The parameter count for an Hourglass MLP is:

Trainable: $P_{\text{hr}}(d, e, b, L) = d d_z$ (input lift) $+ 2L (d_z b)$ (per-block) $= e d^2 + 2L e d b$ .
With fixed $W_\text{in}$ : $P_{\text{fix}} = 2L e d b$ .

Contrast with conventional MLPs (expansion $f$ ):

$P_{\text{nr}}(d, f, L) = 2L d f$ .

To match parameter budgets, Hourglass architectures select $e \gg 1$ , $b \ll f$ , $L \gg 1$ such that $e d^2 + 2L e d b \approx 2L d f$ .

Forward FLOPs per block:

Hourglass: $2 d_z d_h = 2 e d b$ .
Conventional: $2 d f$.

The bottleneck width $b \ll f$ and increased depth $L$ allow Hourglass MLPs to sustain cost parity while enhancing expressivity through deeper stacks operating in wider latent dimensions.

4. Empirical Performance and Scaling Behavior

Hourglass MLPs have been empirically evaluated on image-generation, denoising, and super-resolution tasks using MNIST and ImageNet-32 datasets:

Task	Dataset	Hourglass Params	Conventional Params	Hourglass PSNR	Conventional PSNR
Denoising	MNIST	66 M	75 M	22.31 dB	22.31 dB
Super-resolution	ImageNet-32	69 M	87 M	24.00 dB	24.00 dB

Metrics employed include PSNR (dB), SSIM for reconstruction, and classification accuracy via prototype generation.

Hourglass MLPs consistently achieve superior performance–parameter Pareto frontiers in all evaluated settings. Optimization under increasing parameter budgets consistently drives Hourglass designs toward very large $d_z$ $(e \cdot d \approx 3\,\mathrm{K}$ –4 K $)$ and moderate $d_h$ $(\approx 100$ –300 $)$ , while increasing network depth $L$ $(4$ –8 $)$ rather than bottleneck width. This “wider skip + narrower bottleneck + deeper stack” scaling is not Pareto-optimal for conventional MLPs.

5. Broader Implications and Application Extensions

Findings suggest reconsideration of skip connection dimensionality in residual networks. Replacing conventional feed-forward layers in Transformers with hourglass-style FFNs ( $d \rightarrow e d \rightarrow b \rightarrow e d \rightarrow d$ ), and adapting self-attention mechanisms to operate within the expanded latent $e d$ space, yields potential parameter savings in large-scale LLMs.

In architectures such as U-Nets and MLP-Mixers, injecting a fixed random lift into high-dimensional latent space and operating through narrow-bottleneck Hourglass blocks allows flexible adaptation for tasks including classification, segmentation, and generation. Any residual network currently employing skips at a narrow feature size may achieve increased expressivity and parameter efficiency by relocating skip connections into expanded spaces and routing learned incremental changes through cost-effective bottlenecks.

6. Practical Guidelines for Construction

Recommendations for Hourglass MLP configuration:

Select $e$ such that $e d \approx 3$ –5 K when $d \approx 1$ K, ensuring geometry preservation via random lifts.
Set $d_h$ to a moderate range $(50$ –300 $)$ to maintain per-block cost parity with conventional blocks.
Utilize maximal depth $L$ as allowed by the parameter budget; empirically, $L = 4$ –8 sufficiently saturates performance gains.
Employ fixed random $W_\text{in}$ to optimize parameter usage and memory bandwidth.
Assess model selection along the performance–parameter frontier; Hourglass MLPs typically dominate across varied generative and classification benchmarks.

The scaling and architectural principles identified in Hourglass MLPs suggest wide applicability and invite further investigation into expanded skip-dimensionality and bottleneck routing within modern neural architectures (Chen et al., 2 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Rethinking the shape convention of an MLP (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hourglass MLPs.