Multi-Image Super Resolution (MISR)

Updated 16 January 2026

Multi-Image Super Resolution (MISR) is a computational framework that fuses complementary sub-pixel details from multiple low-resolution images to reconstruct a high-resolution image.
Techniques in MISR encompass classical registration, deep learning fusion, and hybrid approaches to address challenges such as alignment, decimation, and image degradation.
Recent advancements leverage transformer-based aggregations, recursive fusion, and implicit neural representations to enhance performance in applications like remote sensing and biomedical imaging.

Multi-Image Super Resolution (MISR) is a computational imaging framework designed to reconstruct a high-resolution (HR) image from a set of multiple, low-resolution (LR) observations of the same scene. By aggregating complementary sub-pixel information inherently captured through geometric shifts, temporal diversity, or multispectral differences among the LR frames, MISR aims to surpass the fundamental limitations faced by single-image super-resolution (SISR). The domain encompasses model-based optimization approaches, deep learning solutions, and hybrid architectures, widely applied in remote sensing, biomedical imaging, photography, and spectral mapping.

1. Mathematical Formulation and Physical Imaging Models

The canonical MISR problem is modeled as the reconstruction of $X \in \mathbb{R}^{H_r \times W_r \times C}$ , the HR image, from $T$ LR frames $\{Y_t \in \mathbb{R}^{H_\ell \times W_\ell \times C}\}_{t=1}^T$ , each acquired via a distinct forward operator $D_t$ and affected by additive noise $\varepsilon_t$ :

$Y_t = D_t(X) + \varepsilon_t, \quad t=1,\dots,T$

where $D_t$ typically encodes a spatial warp (subpixel shift), blur (modulation transfer function), decimation (downsampling), and possibly acquisition-specific effects (e.g., atmospheric, sensor nonlinearities) (Retnanto et al., 30 May 2025, Salvetti et al., 2020, Wang et al., 2017, Jyhne et al., 9 Dec 2025). The objective is to recover an estimator

$\hat X = f(Y_1,\dots,Y_T)$

that maximizes fidelity to the unknown $X$ subject to the observed LR images and the physics of the acquisition. MISR inverse problems are often approached via maximum a posteriori (MAP) optimization with explicit regularization (e.g., bilateral TV (Wang et al., 2017), deep prior (Retnanto et al., 30 May 2025), or implicit neural representations (Jyhne et al., 9 Dec 2025)).

2. Fusion Mechanisms and Alignment Strategies

Precise fusion of sub-pixel information mandates accurate registration of the LR frames and a robust feature aggregation protocol:

2.1. Classical and Deep-Learning Fusion

Classical approaches (Wang et al., 2017, Salvetti et al., 2020) utilize explicit registration algorithms—phase correlation, optical flow, or projective warping—followed by shift-and-add or iterative kernel-based refinement. State-of-the-art neural MISR designs embed registration in the feature extraction stage, either via learned spatial transforms (Retnanto et al., 30 May 2025), deformable convolutions (Huang et al., 26 May 2025), or explicit geometric modeling (epipolar transformers in EpiMISR (Aira et al., 2024)).

2.2. Pairwise and Recursive Fusion

The recursive pairwise fusion methodology, as implemented in HighResNet and SEN4X, merges feature maps via ResBlocks and convolutional mergers:

$M(F_i^{(k)}, F_j^{(k)}) = \mathrm{ResConv}_\mathrm{merge}\left[ \mathrm{ResBlock}(F_i^{(k)}) + \mathrm{ResBlock}(F_j^{(k)}) \right]$

Applied recursively, this yields a fused feature $T$ 0, enabling effective integration of misaligned sub-pixel information without precomputed shifts (Retnanto et al., 30 May 2025).

2.3. Transformer-Based Multiframe Aggregation

MISR models such as ESC-MISR and BurstSR leverage multi-image spatial transformers (MIST) or multi-cross attention encoding to emphasize global spatial correlations and inter-frame interactions (Zhang et al., 2024, Huang et al., 26 May 2025). Random shuffling of LR sequence order further attenuates undesirable temporal dependencies and enforces permutation-invariance in the learned representation (Zhang et al., 2024).

3. Deep Neural Architectures for MISR

Deep MISR architectures synthesize the representational power of convolutional and attention-based networks for joint reconstruction:

3.1. Hybrid SISR+MISR Pipelines

The SEN4X model fuses multi-view Sentinel-2 features via early fusion and refines them through a Swin Transformer backbone trained on Pleiades Neo data. The architecture is:

Shallow 3×3 Conv extractors for each revisit.
Recursive ResBlock-based multi-image fusion.
Six Residual Swin Transformer Blocks (RSTB) for prior learnings.
Pixel-shuffle for upsampling (Retnanto et al., 30 May 2025).

This configuration has demonstrated significant gains in spatial fidelity for land cover segmentation tasks.

3.2. Attention-Driven Multi-Image Fusion

Transformer hybrids, e.g. CoT-MISR and ESC-MISR, alternate convolutional spatial and channel attention modules with transformer blocks for deep multi-image context integration (Xiu et al., 2023, Zhang et al., 2024). MISAB blocks compute message tokens per patch, followed by spatial cross-attention and global fusion.

3.3. Implicit Neural Field Approaches

SuperF adopts a test-time optimization paradigm with implicit neural representation (INR):

$T$ 1

Homogeneously parameterizing sub-pixel alignment for all LR frames and optimizing the network weights jointly enables continuous-space HR image construction without external HR training data (Jyhne et al., 9 Dec 2025).

4. Optimization Objectives and Regularization

MISR models generally minimize a pixel-wise loss with optional regularization:

$T$ 2 or $T$ 3 reconstruction loss between $T$ 4 and $T$ 5 (Retnanto et al., 30 May 2025, Agarwal et al., 9 Jan 2026), or clear-pixel MSE for cloud-masked remote sensing data (Lee, 2022).
High-pass PSNR (PSNR $T$ 6) to emphasize recovery of fine details (Kawulok et al., 2019).
Bilateral total variation (BTV) regularization to preserve edges while smoothing noise (Wang et al., 2017).
Radiometric consistency terms to enforce fidelity of multispectral SR to sensor-specific distributions (Razzak et al., 2021).

Adversarial and perceptual losses (SSIM, LPIPS) are rarely used in standard MISR for scientific imaging due to their tendency to induce hallucinated structures; domain-appropriate losses may be investigated for specific applications.

5. Empirical Performance, Downstream Utility, and Comparative Analysis

5.1. Image and Task-Level Metrics

MISR models are benchmarked using PSNR, SSIM, cPSNR (corrected-clear PSNR), LPIPS, and task-specific metrics such as IoU for segmentation (Retnanto et al., 30 May 2025, Razzak et al., 2021).

Sentinel-2 Land Cover Classification (Hanoi):

Method	Overall Acc.	Mean IoU
Bicubic	44.0%	0.278
Swin2SR	71.4%	0.489
HighResNet	58.3%	0.387
SEN4X	74.6%	0.516

MISR-based SR substantially boosts semantic segmentation performance compared to both bicubic and pure SISR. Qualitative analysis shows improved delineation of roads, buildings, and vegetation boundaries.

5.2. Domain-Specific Validation

In remote sensing, fusion models (HighResNet, RAMS, TR-MISR, CoT-MISR, ESC-MISR) set new cPSNR/cSSIM records on PROBA-V validation/test sets (Zhang et al., 2024).
Medical image CMISR demonstrates +0.3 to +0.6 dB PSNR improvements and sharper edge recovery on MRI/CT/X-ray datasets versus open-loop baselines (Li et al., 2023).
BurstSR and EpiMISR outperform flow-based approaches under large disparities and arbitrary camera poses in synthetic and real photography (Aira et al., 2024, Huang et al., 26 May 2025).
MI-DRCT achieves structurally sharper root hair visibility in underground plant imaging (Agarwal et al., 9 Jan 2026).
Multi-BVOC emission mapping shows that fusion of uncorrelated auxiliary compounds yields better HR map reconstruction than highly correlated ones (Giganti et al., 2023).

6. Algorithmic Efficiency, Scalability, and Practical Considerations

6.1. Computational Bottlenecks and Speedups

Classical patch-based, shift-and-add implementations are computationally demanding; fast upscaling techniques leveraging LR-domain filters and pixel shuffling improve runtime by 30–50% (Wang et al., 2017). FL-MISR introduces multi-GPU distributed scaled conjugate gradient (SCG), inner-outer border exchange, and on-the-fly CT pipeline reconstruction, achieving over 50× speedup versus CPU while delivering full resolution recovery in industrial CT acquisition (Sun et al., 2021).

6.2. Reference Awareness

Recent work has elucidated the critical importance of explicit reference selection in MISR datasets and benchmarks. PROBA-V-REF, which provides the true LR–HR temporal correspondence, unmasks +2.5–3 dB PSNR gains over blind fusion and more accurately reflects fusion quality across methods (Nguyen et al., 2021). Future evaluations, architectures, and datasets should prioritize reference-aware training and metrics.

7. Limitations, Extensions, and Future Research

MISR remains subject to key limitations and open avenues:

Assumes static scenes or restricts dynamics (exception: video SR, EpiMISR extensions to dynamic settings) (Aira et al., 2024, Haris et al., 2019).
High computational cost for large input stacks, order-invariance, and spatial attention blocks—addressed via shuffle strategies and attention sparsification (Zhang et al., 2024, Huang et al., 26 May 2025).
Absence of external HR training data for test-time optimization (SuperF); need for domain-specific tuning of INR spectral scales (Jyhne et al., 9 Dec 2025).
Potential degradation under erroneous quality maps or heavily occluded input frames (Lee, 2022).
Ascertain fusion strategies for multi-field data: uncorrelated auxiliary sources improve accuracy (as in BVOC mapping), general principle worthy of cross-domain adoption (Giganti et al., 2023).
Synergistic fusion of single- and multi-image SR networks (SEN4X), attention-driven architectures, and physically informed fusion layers pave the way for adaptive, multispectral, and real-time applications (Retnanto et al., 30 May 2025).

References

Key works referenced herein include "Beyond Pretty Pictures: Combined Single- and Multi-Image Super-resolution for Sentinel-2 Images" (Retnanto et al., 30 May 2025), "Deep 3D World Models for Multi-Image Super-Resolution Beyond Optical Flow" (Aira et al., 2024), "Multi-image Super-resolution via Quality Map Associated Attention Network" (Lee, 2022), "SuperF: Neural Implicit Fields for Multi-Image Super-Resolution" (Jyhne et al., 9 Dec 2025), "ESC-MISR: Enhancing Spatial Correlations for Multi-Image Super-Resolution in Remote Sensing" (Zhang et al., 2024), "Multi-Image Super Resolution of Remotely Sensed Images using Residual Feature Attention Deep Neural Networks" (Salvetti et al., 2020), "FL-MISR: Fast Large-Scale Multi-Image Super-Resolution for Computed Tomography Based on Multi-GPU Acceleration" (Sun et al., 2021), and "Proba-V-ref: Repurposing the Proba-V challenge for reference-aware super resolution" (Nguyen et al., 2021).