Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Published 24 May 2025 in cs.CV, cs.AI, and cs.LG | (2505.18600v2)

Abstract: Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-LLM (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/ .

Abstract PDF Upgrade to Chat

Authors (3)

Summary

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

The paper "Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment" proposes a novel approach in the field of Single Image Super-Resolution (SISR), addressing the scalability limitations of current models when applied to magnifications beyond their training regime. The authors introduce the Chain-of-Zoom (CoZ) framework, which utilizes a model-agnostic methodology that iteratively factorizes SISR into an autoregressive sequence of scale states, coupled with multi-scale-aware prompts. This approach allows existing SR models to achieve extreme resolutions without additional training, by utilizing a backbone super-resolution model repeatedly and decomposing the conditional probability into tractable sub-problems.

Approach and Methodology

Chain-of-Zoom (CoZ) innovatively employs scale-level autoregression by introducing intermediate scale-states, which act as bridges between a low-resolution input and the desired high-resolution output. The framework models the image generative process via these intermediate states, allowing the decomposition of the complex distribution $p(H \mid L)$ into more manageable components. Additionally, to compensate for the diminishing visual cues at high magnifications, CoZ integrates multi-scale-aware text prompts, guided by Vision-LLMs (VLMs). These prompts, fine-tuned through Reinforcement Learning with GRPO, align text guidance with human preferences, significantly enhancing the capability of SR models to maintain semantic coherence across extreme magnification levels.

Experimental Results

The study demonstrates the efficacy of CoZ by employing a standard $4\times$ diffusion SR model wrapped in this framework, successfully achieving magnifications beyond $256\times$ with high perceptual quality. Quantitative assessments on diverse no-reference perceptual metrics like NIQE, MUSIQ, and CLIPIQA indicate marked improvements in visual fidelity and semantic alignment compared to conventional methods. The VLM-guided prompt extraction further aids in maintaining high-frequency detail without unwarranted hallucinations, especially at extreme magnification levels. Qualitative results corroborate these findings, illustrating superior performance across a range of scales.

Implications and Future Directions

The implications of this research are multifaceted. Practically, Chain-of-Zoom provides a resource-efficient solution to the problem of modeling extreme resolutions, obviating the need for training new models for every desired scale increase. This flexibility is particularly beneficial in contexts like medical imaging and satellite surveillance, where high detail and fidelity are crucial. Theoretically, CoZ opens avenues for exploring adaptive approaches in zoom strategies and customized guidance using text prompts, paving the way for more robust integrations of vision-language systems in generative models.

In terms of future directions, the researchers hint at the exploration of learned zoom policies and domain-specific reward functions, which could further optimize the performance and applicability of CoZ in diverse areas. Additionally, adaptive backbone selection strategies could be developed, enhancing model robustness across different imaging domains and input characteristics.

In conclusion, the Chain-of-Zoom framework represents a significant step forward in overcoming the traditional bottlenecks associated with extreme image magnification. By leveraging autoregression and multi-scale text guidance, it sets a promising precedent for the evolution of Single Image Super-Resolution techniques and their practical applications across various fields.

Markdown Report Issue