Create a Video View Paper

ELT: Elastic Looped Transformers for Visual Generation

This presentation explores Elastic Looped Transformers, a breakthrough in parameter-efficient image and video generation. By recursively reusing a compact set of transformer blocks and training with Intra-Loop Self Distillation, ELT achieves state-of-the-art quality with 4× fewer parameters than conventional models while enabling dynamic compute allocation at inference time—a single model that adapts seamlessly from mobile devices to high-performance servers.

Script

What if you could build an AI that generates cinema-quality images with just one-quarter the parameters of today's models, and then dial its computational cost up or down at will without retraining? The authors behind Elastic Looped Transformers have done exactly that, fundamentally rethinking how we architect generative models.

Conventional generative transformers are trapped in a parameter arms race. Each layer is unique, so doubling depth means doubling memory footprint. On distributed hardware, this creates a brutal bandwidth bottleneck as weights constantly move between chips, throttling the very speed these models are built for.

The authors break this constraint with a deceptively simple idea: what if depth didn't require new parameters?

ELT models reuse the same compact set of transformer blocks recursively within each denoising step. This disassociates depth from parameters entirely. An ELT with 111 million parameters achieves the same image quality as a 446 million parameter baseline, yet trains faster and runs with far less memory traffic because the weights never leave the accelerator.

But recurrent architectures have a fatal flaw: intermediate states can become meaningless noise if only the final output is supervised. The authors solve this with Intra-Loop Self Distillation. The model at maximum loop depth acts as a teacher, supervising its own earlier iterations. This forces every loop to produce a meaningful prediction, unlocking a remarkable property: at inference, you can stop at any loop count and still get a coherent image. One model, infinite compute-quality tradeoffs, no retraining required.

The results tell a striking story. This curve shows that through recursive looping, a model with just 111 million parameters reaches the same FID score as baselines exceeding 400 million parameters. More importantly, the slope reveals that within bounded parameter budgets, looping architectures extract far more quality per parameter than conventional depth scaling. It's not just efficiency, it's a fundamentally better scaling law.

These gains aren't limited to images. On video generation, a 76 million parameter ELT outperforms a 306 million parameter baseline, demonstrating that looped transformers resist overfitting in data-constrained regimes. And because the same model gracefully scales from low to high compute, you can deploy it on a mobile device that runs fewer loops, or a server that runs more, without changing a single weight.

Elastic Looped Transformers prove that architectural cleverness can outpace brute-force scaling. By decoupling depth from parameters and training every loop to be meaningful, the authors have built models that are smaller, faster, and infinitely more flexible than anything before them. To explore more cutting-edge research like this and create your own AI video presentations, visit EmergentMind.com.