Arbitrary-Scale Image Super-Resolution

Updated 1 February 2026

Arbitrary-Scale Image Super-Resolution is a technique that reconstructs high-resolution images from low-resolution inputs using a continuous mapping from spatial coordinates.
It integrates implicit neural representations, multi-scale transformers, and frequency-domain modules to allow flexible and seamless upscaling across various applications.
Recent advances combine diffusion models, Gaussian splatting, and vector graphics to achieve state-of-the-art perceptual metrics and computational efficiency.

Arbitrary-Scale Image Super-Resolution (ASSR) refers to the task of reconstructing high-resolution (HR) images from low-resolution (LR) counterparts at any specified, potentially non-integer, upsampling factor using a single unified model. Unlike traditional super-resolution (SR) techniques, which are tied to fixed upsampling factors (e.g., ×2, ×4), ASSR architectures aim to provide continuous and flexible upscaling, supporting seamless zoom and adaptation to the scale requirements of diverse applications. Recent ASSR research encompasses advances in implicit neural representations (INRs), Gaussian field modeling, frequency-domain processing, diffusion models, vector graphics abstraction, and adaptive transformer architectures.

1. Foundational Principles of Arbitrary-Scale Super-Resolution

ASSR models are formulated to encapsulate the underlying image signal as a continuous mapping from spatial coordinates and latent representations to intensity values. Canonical approaches in INR-based ASSR learn a neural function

$I_\text{SR}(x, y) = f_\theta(\text{feat}_\text{enc}(I_\text{LR}), (x, y), \textrm{scale}),$

with $\text{feat}_\text{enc}$ denoting a CNN or transformer encoder extracting dense features from $I_\text{LR}$ , $(x, y)$ denoting HR pixel coordinates (often normalized), and $f_\theta$ a multi-layer perceptron (MLP) decoder tasked with predicting the color at any arbitrary spatial coordinate. The continuity and adaptability of $f_\theta$ obviate the need for scale-specific networks or discrete upsampling heads (Wu et al., 2021, Nguyen et al., 2022).

Distinct from earlier fixed-scale SR architectures, this approach allows a single model to generalize across arbitrary, fractional, or integer scales. The coordinate-conditioned paradigm furnishes a continuous signal representation, supporting non-uniform, anisotropic zoom and resolving the fundamental fixed-scale limitation in conventional methods (Wu et al., 2021, Fang et al., 2022).

2. Neural Architectures and Implicit Representation Models

2.1 Implicit Neural Representations (INRs)

Most INR-based ASSR frameworks follow an encoder–decoder split. The encoder (e.g., Residual Dense Network, RDN; EDSR; SwinIR) processes $I_\text{LR}$ to produce a spatial feature grid. For the decoder, a typical methodology interpolates feature vectors at subpixel positions and concatenates them with the HR coordinate, which is then mapped through a small MLP to the RGB value. Some models, such as the Dual Interactive Implicit Neural Network (DIINN), explicitly disentangle content and positional branches using two interacting MLPs for fine-grained spatial adaptation (Nguyen et al., 2022).

Recent empirical studies have demonstrated that architectural variations beyond the core encoder–implicit-function—such as hierarchical, locally-ensemble, or self-attention–augmented INRs—produce only marginal PSNR/SSIM improvements (often less than 0.05 dB), with model capacity, data diversity, and training recipe playing a more decisive role in final performance (Nasir et al., 25 Jan 2026).

2.2 Multi-Scale Transformers and State Space Approaches

To better capture global context and multi-frequency structures, transformer-based backbones and scalable state-space models have been introduced. The Multi-Scale Implicit Transformer (MSIT) incorporates multi-scale neural operators and self-attention blocks tailored for scale-adaptive context fusion, outperforming standard INRs by leveraging richer receptive fields and multi-scale latent code interactions (Zhu et al., 2024). Similarly, S³Mamba replaces the decoder MLP with a scalable, linear-complexity state-space model (SSSM) that modulates its memory and sampling parameters as a function of scale and coordinate, achieving precise continuous-time modeling and global feature fusion (Xia et al., 2024).

A parallel development is the Task-Aware Dynamic Transformer (TADT), which uses a dynamic routing controller to selectively activate self-attention branches in a multi-scale transformer backbone, achieving competitive accuracy with up to 20% fewer FLOPs by matching computational cost to the difficulty of the input image and requested scale (Xu et al., 2024).

2.3 Frequency-Domain and Hybrid Architectures

Frequency-domain models, such as FreeSR and the Frequency-Integrated Transformer (FIT), introduce modules that explicitly model the frequency degradation incurred during downsampling. FreeSR adaptively selects valid low-frequency bands using deep reinforcement learning and recovers high-frequency spectra with a scale-gated recovery module, whereas FIT injects FFT-based features into a transformer core to maximize frequency-spatial synergy and global context aggregation (Fang et al., 2022, Wang et al., 26 Apr 2025).

3. Beyond MLPs: Gaussian Splatting, Heat Fields, and Vector Decomposition

3.1 Two-Dimensional Gaussian Splatting

Explicit continuous modeling via 2D Gaussian splatting addresses the limitations of pointwise INR decoders with respect to representation power and computational efficiency. Both GaussianSR and GSASR encode each LR pixel or feature as a continuous Gaussian field with position, covariance, and color, permitting highly parallel rendering at arbitrary HR grids (Hu et al., 2024, Chen et al., 12 Jan 2025). This methodology naturally enables long-range dependencies, adaptive smoothing, and interpretable content-adaptive kernel assignments. Explicit rasterization is realized via highly parallel, differentiable CUDA kernels, resulting in substantial speed advantages over pointwise MLP query loops and memory efficiency at extreme scales.

3.2 Neural Heat Fields and Aliasing-Free Rendering

Thera introduces neural heat fields—MLPs parameterized to solve the isotropic heat equation—ensuring that any change in rendered resolution matches the exact physical point spread function of the image acquisition process. By analytically varying the heat-equation time parameter according to the desired output scale, Thera provides provably correct anti-aliasing at any output size, with no extra runtime cost and strong generalization to unseen scales (Becker et al., 2023).

3.3 Vector Graphics via Stroke-Based Decomposition

For ultra-large upsampling factors, stroke-based methods such as Stroke-based Cyclic Amplifier (SbCA) leverage vector abstraction. The input is decomposed via a policy network into a sequence of Bézier strokes, rendered as high-resolution vector graphics, and then refined by a diffusion-based detail completion module. This two-stage, cyclic strategy preserves sharp geometric primitives at extreme scales while progressively restoring high-frequency details, achieving strong perceptual realism (as measured by LPIPS, MUSIQ, NIQE, etc.) even at ×30 or ×100 (Guo et al., 12 Jun 2025).

4. Diffusion-Based Models for Arbitrary-Scale Super-Resolution

Diffusion generative models, originally developed for unconditional or text-conditional image synthesis, have been adapted for ASSR both as standalone super-resolution priors and as fidelity–realism balancers. OmniScaleSR exploits a pre-trained latent diffusion prior and introduces explicit, scale-embedded control signals that modulate both global and local structure within every UNet block. This enables the same backbone to adapt its generative–regressive behavior smoothly across the entire range of upsampling factors, producing state-of-the-art perceptual metrics at both modest and extreme scales (Chai et al., 4 Dec 2025). Diff-SR demonstrates that pre-trained diffusion models can be repurposed for ASSR by injecting an optimally tuned amount of noise into the input LR image, with the Perceptual Recoverable Field (PRF) metric quantifying the trade-off between semantic preservation and texture fidelity as a function of scale (Li et al., 2023).

Multi-stage or cascaded variants (e.g., CasArbi) recursively decompose the overall scale into a sequence of tractable upsampling steps, each handled by a coordinate-aware residual diffusion network, ensuring stability, high fidelity, and smooth transitions even over a wide scale range (Bang et al., 9 Jun 2025).

5. Equivariance, Generalization, and Model Adaptivity

Rotation Equivariant ASSR architectures embed group-equivariant operations at every layer, including group convolution “B-Conv” encoders and equivariant INR-based decoders. This ensures that the output remains consistent under discrete rotations, a property that empirically improves texture and pattern reconstruction, especially on images with strong geometric or isotropic structures (Xie et al., 7 Aug 2025).

Generalization across domains, scales, and image types remains a challenge, with model capacity, data diversity, and loss design all exerting significant influence. Empirical studies on INR-based frameworks show diminishing returns from architectural elaboration, highlighting the overriding importance of proper training recipe, objective function (including pixel–gradient hybrid losses for enhanced texture/edge sharpness), and scale/data augmentation (Nasir et al., 25 Jan 2026). Adaptive transformer backbones (TADT) and plug-in modules (ARIS) can further increase efficiency or facilitate domain transfer, allowing existing SR models to benefit from arbitrary-scale protocols (Zhou et al., 2022, Xu et al., 2024).

6. Performance Benchmarks and Limitations

Performance of modern ASSR methods is measured on standard benchmarks (DIV2K, Set5/14, Urban100, Manga109, B100) across a spectrum of integer and fractional scales. The most recent methods—MSIT, GSASR, FIT, OmniScaleSR, and Thera—achieve state-of-the-art PSNR and perceptual scores across seen and unseen scales. Gaussian splatting and neural heat field approaches provide significant computational efficiency, with GSASR rendering at ~1 ms per scale and Thera achieving state-of-the-art accuracy with minimal parameter and memory overhead (Zhu et al., 2024, Chen et al., 12 Jan 2025, Becker et al., 2023).

Limitations remain, including:

The domain-dependence of learned features (e.g., training on brain MRI, T₁w scans, or DIV2K natural images does not guarantee transfer to other domains) (Wu et al., 2021).
High computational or memory costs for transformer-based or deeply cascaded models at very large HR sizes.
Suboptimal perceptual realism or texture fidelity from regression-only (L1/L2) objective functions at extreme scales, except where explicitly addressed by diffusion priors or adversarial modules.
The need for efficient anti-aliasing (addressed analytically only in Thera) and instability or error accumulation in recursive/cascaded systems (mitigated by vector- or diffusion-based pipelines).

7. Outlook and Future Directions

Recent advances point to several leading research directions:

Tighter integration of signal processing priors (e.g., heat equation, frequency masking, aliasing models) with neural architectures for physically accurate and robust supersampling (Becker et al., 2023, Wang et al., 26 Apr 2025).
Model architectures that further combine explicit, content-adaptive, continuous basis representations (e.g., Gaussian fields, vector strokes) with generative refinement for perceptual realism at ultra-large scales (Hu et al., 2024, Guo et al., 12 Jun 2025).
Scalability and efficiency via dynamic routing, state-space models, and cross-domain plug-and-play modules, reducing both the computation and storage overhead of universal ASSR deployments (Xia et al., 2024, Xu et al., 2024, Zhou et al., 2022).
Extending ASSR frameworks to video, multi-view, and hyperspectral/thermal modalities, where geometric and contextual priors become even more critical (Xie et al., 7 Aug 2025).
Systematic methodological benchmarking emphasizing not only architectural innovation but careful control of training protocol and evaluation fidelity, as marginal improvements over established methods continue to shrink (Nasir et al., 25 Jan 2026).

ASSR has emerged as a mature computational area, with diverse robust approaches—spanning INRs, diffusion, frequency analysis, and graphical abstraction—establishing continuous, scale-free super-resolution as a practical and theoretically grounded tool in image restoration and enhancement.