Papers
Topics
Authors
Recent
Search
2000 character limit reached

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Published 24 May 2025 in cs.CV, cs.AI, and cs.LG | (2505.18600v2)

Abstract: Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-LLM (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/ .

Summary

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

The paper "Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment" proposes a novel approach in the field of Single Image Super-Resolution (SISR), addressing the scalability limitations of current models when applied to magnifications beyond their training regime. The authors introduce the Chain-of-Zoom (CoZ) framework, which utilizes a model-agnostic methodology that iteratively factorizes SISR into an autoregressive sequence of scale states, coupled with multi-scale-aware prompts. This approach allows existing SR models to achieve extreme resolutions without additional training, by utilizing a backbone super-resolution model repeatedly and decomposing the conditional probability into tractable sub-problems.

Approach and Methodology

Chain-of-Zoom (CoZ) innovatively employs scale-level autoregression by introducing intermediate scale-states, which act as bridges between a low-resolution input and the desired high-resolution output. The framework models the image generative process via these intermediate states, allowing the decomposition of the complex distribution p(H∣L)p(H \mid L) into more manageable components. Additionally, to compensate for the diminishing visual cues at high magnifications, CoZ integrates multi-scale-aware text prompts, guided by Vision-LLMs (VLMs). These prompts, fine-tuned through Reinforcement Learning with GRPO, align text guidance with human preferences, significantly enhancing the capability of SR models to maintain semantic coherence across extreme magnification levels.

Experimental Results

The study demonstrates the efficacy of CoZ by employing a standard 4×4\times diffusion SR model wrapped in this framework, successfully achieving magnifications beyond 256×256\times with high perceptual quality. Quantitative assessments on diverse no-reference perceptual metrics like NIQE, MUSIQ, and CLIPIQA indicate marked improvements in visual fidelity and semantic alignment compared to conventional methods. The VLM-guided prompt extraction further aids in maintaining high-frequency detail without unwarranted hallucinations, especially at extreme magnification levels. Qualitative results corroborate these findings, illustrating superior performance across a range of scales.

Implications and Future Directions

The implications of this research are multifaceted. Practically, Chain-of-Zoom provides a resource-efficient solution to the problem of modeling extreme resolutions, obviating the need for training new models for every desired scale increase. This flexibility is particularly beneficial in contexts like medical imaging and satellite surveillance, where high detail and fidelity are crucial. Theoretically, CoZ opens avenues for exploring adaptive approaches in zoom strategies and customized guidance using text prompts, paving the way for more robust integrations of vision-language systems in generative models.

In terms of future directions, the researchers hint at the exploration of learned zoom policies and domain-specific reward functions, which could further optimize the performance and applicability of CoZ in diverse areas. Additionally, adaptive backbone selection strategies could be developed, enhancing model robustness across different imaging domains and input characteristics.

In conclusion, the Chain-of-Zoom framework represents a significant step forward in overcoming the traditional bottlenecks associated with extreme image magnification. By leveraging autoregression and multi-scale text guidance, it sets a promising precedent for the evolution of Single Image Super-Resolution techniques and their practical applications across various fields.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 68 likes about this paper.