Multi-Scale Frame Synthesis Network (MSFSN)

Updated 29 January 2026

The paper introduces MSFSN, a unified framework that synthesizes intermediate or future video frames using a multi-scale pyramid architecture and transitive consistency loss.
MSFSN employs weight-shared CNN sub-networks and normalized temporal conditioning to enable interpolation and extrapolation without explicit optical flow estimation.
The model achieves competitive pixel accuracy and perceptual quality while remaining compact, and it robustly handles occlusions and large deformations in video sequences.

The Multi-Scale Frame-Synthesis Network (MSFSN) is a unified deep learning framework for high-fidelity video frame interpolation and extrapolation. Distinguished by its multi-scale (pyramid) architecture, normalized temporal conditioning, and the introduction of a transitive consistency loss, MSFSN synthesizes intermediate or future frames directly from input pairs, eschewing explicit optical flow estimation. This approach achieves competitive performance against state-of-the-art optical-flow-based and learning-based video synthesis methods in terms of pixel accuracy, perceptual quality, and computational efficiency (Hu et al., 2017).

1. Architectural Principles

The MSFSN is structured as a hierarchical pyramid with $S$ levels (default $S = 4$ ). At the coarsest level, sub-network $N_1$ processes two downsampled input frames, $x_{t_1}^1, x_{t_2}^1 \in \mathbb{R}^{m/2^{S-1} \times n/2^{S-1}}$ , producing a coarse prediction $y_{t_p}^1$ . Each subsequent level $s=2,\ldots,S$ utilizes a single sub-network $N$ (weight-shared across levels), which receives downsampled inputs $(x_{t_1}^s, x_{t_2}^s)$ as well as an upsampled ( $2\times$ via pixel-shuffle) version of the previous scale's prediction, forming $(x_{t_1}^s, x_{t_2}^s, \uparrow y_{t_p}^{s-1})$ as input.

Each sub-network comprises:

An initial $5\times5$ convolutional layer with 64 filters,
$D$ residual blocks (each containing two $5\times5$ convolutions with skip-connections and LeakyReLU activations with $\alpha=0.2$ , omitting batch-normalization),
A final $5\times5$ convolution outputting the RGB frame prediction.

This parameter-sharing design makes model size independent of $S$ . Empirically, $D=9$ achieves a practical equilibrium between inference speed and frame quality.

2. Temporal Conditioning via Relative Indexing

MSFSN frames synthesis as a continuous function of a normalized relative timestamp,

$r_p = \frac{t_p - t_1}{t_2 - t_1} \in (-\infty, \infty)$

This scalar, packaged as an additional input channel, is concatenated to both input frames and (as needed) to generated frames for loss computation. During inference, $r_p\in(0,1)$ requests interpolation, while $r_p<0$ or $r_p>1$ performs extrapolation—without requiring model reconfiguration or retraining. This parameterization enables flexible, position-agnostic prediction across the temporal span defined by $\{x_{t_1}, x_{t_2}\}$ .

3. Objective and Transitive Consistency Loss

MSFSN is optimized via a composite objective: $\mathcal{L}(G, D) = \mathcal{L}_{pix} + \lambda_{feat}\,\mathcal{L}_{feat} + \lambda_{GAN}\,\mathcal{L}_{GAN} + \lambda_{tran}\,\mathcal{L}_{tran}$

Pixel Reconstruction Loss ( $\ell_1$ ): Enforces fidelity to ground-truth $y_{t_p}$ ,

$\mathcal{L}_{pix}(G) = \mathbb{E}[\|G(x_{t_1},x_{t_2},t_p)-y_{t_p}\|_1]$

Feature (Perceptual) Loss: Measures difference in VGG-16 feature space,

$\mathcal{L}_{feat}(G) = \mathbb{E}[\|\phi(G(x_{t_1},x_{t_2},t_p))-\phi(y_{t_p})\|_2]$

Adversarial Loss: Employs discriminator $D$ to promote photorealism,

$\mathcal{L}_{GAN}(G, D) = \mathbb{E}[\log D(x_{t_p})] + \mathbb{E}[\log(1 - D(G(x_{t_1},x_{t_2},t_p)))]$

Transitive Consistency Loss: Encourages the generator $G$ to be time-reversible,

$\mathcal{L}_{tran}(G) = \mathbb{E}[\|G(x_{t_1}, y_{t_p}, t_2) - x_{t_2}\|_1] + \mathbb{E}[\|G(y_{t_p}, x_{t_2}, t_1) - x_{t_1}\|_1]$

This regularization enforces $G$ to reconstruct $x_{t_2}$ and $x_{t_1}$ when the prediction $y_{t_p}$ is considered as an effective intermediary, driving temporal coherence and consistency in synthesized results.

4. Training Methodology and Hyperparameters

Training relies on triplets $(x_{t_1},x_{t_2},t_p)$ sampled randomly from GOPRO, UCF-101, and THUMOS-15 datasets. Preprocessing includes rotations, flips, noise injection, and cropping (to $128\times128$ patches), with mirror-padding at test time to ensure divisibility by $2^{S-1}$ . Optimization utilizes the Adam algorithm ( $\beta_1=0.9$ , $\beta_2=0.999$ ), with mini-batches of eight and a staged learning rate schedule (initially $10^{-4}$ for generator pre-training, then annealing during adversarial fine-tuning). Loss weights are set as $\lambda_{feat}=2\times10^{-5}$ , $\lambda_{GAN}=5\times10^{-2}$ , and $\lambda_{tran}=0.2$ .

During both training and inference, the number of pyramid levels $S$ may be varied, with increased levels enhancing robustness to large motions due to coarser initial predictions. For all tasks, losses are computed solely at the finest scale, $S$ .

5. Quantitative and Qualitative Performance

Quantitative Benchmarks

MSFSN demonstrates strong results on both interpolation and extrapolation tasks across UCF-101 and THUMOS-15 test sets, measured using PSNR and SSIM:

Method	Interp UCF-101 PSNR/SSIM	Interp THUMOS-15 PSNR/SSIM	Extrap UCF-101 PSNR/SSIM	Extrap THUMOS-15 PSNR/SSIM
BeyondMSE	32.8 / 0.93	32.3 / 0.91	30.6 / 0.90	30.2 / 0.89
EpicFlow	34.2 / 0.95	33.9 / 0.94	31.3 / 0.92	31.0 / 0.92
FlowNet2	34.0 / 0.94	33.8 / 0.94	31.8 / 0.92	31.7 / 0.92
DVF	35.8 / 0.95	35.4 / 0.95	32.7 / 0.93	32.2 / 0.92
AdapSC	36.2 / 0.95	36.4 / 0.96	—	—
Ours-Gen	36.0 / 0.95	35.5 / 0.95	32.8 / 0.93	32.2 / 0.92
Ours	35.8 / 0.95	35.2 / 0.95	32.4 / 0.93	31.9 / 0.92

Model compactness is notable, with 7.4M parameters and a memory footprint of 29.7MB (for $D=9$ ), significantly smaller than DVF and AdapSC.

Qualitative Observations and Limitations

MSFSN suffers fewer warping artifacts than optical flow approaches in scenarios with occlusion and large deformations, generating visually plausible, temporally coherent frames. The adversarial loss notably enhances sharpness and texture. In multi-frame interpolation (e.g., on KITTI), MSFSN surpasses two-stage synthesis approaches in preserving structural regularity.

User study results (42 participants) indicate a consistent preference for MSFSN outputs over those of FlowNet2, EpicFlow, DVF, and equivalence with AdapSC.

Limitations include a tendency for results to appear slightly blurrier than perfect correspondence-based approaches (e.g., AdapSC) when optical flow is reliable, and a restricted capacity to model highly non-linear motions due to the scalar parameterization of time.

6. Significance and Application Scope

MSFSN unifies interpolation and extrapolation within a single well-regularized model, dispensing with dataset- or task-specific retraining. Its pyramid design and weight-sharing mechanisms facilitate adaptation to a broad range of motions and video domains. The model achieves a favorable trade-off between visual fidelity, artifact minimization, and computational efficiency. Its ability to leverage arbitrary relative positions ( $r_p\in \mathbb{R}$ ) enables continuous frame synthesis within and beyond the initially observed temporal window, which is critical for applications in frame upsampling, video restoration, and predictive video modeling.

7. Relation to Prior Work and Future Directions

Unlike classical interpolation/extrapolation techniques that depend on optical flow or explicit pixel correspondences—often leading to artifacts in presence of flow estimation failures—MSFSN accomplishes frame synthesis in a direct, feedforward manner and shows robust performance across occlusions and large deformations (Hu et al., 2017). Compared to earlier autoencoder approaches, MSFSN's multi-scale parameterized design and transitive loss yield improved flexibility and temporal consistency.

A plausible implication is that future research may extend MSFSN's parameterization (e.g., high-order time-parameter conditioning) or integrate explicit trajectory modeling to address limitations with complex nonlinear motions. Model compression and domain adaption could advance deployment in real-time or resource-constrained settings.

Markdown Report Issue Upgrade to Chat

References (1)

Multi-Scale Video Frame-Synthesis Network with Transitive Consistency Loss (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Scale Frame-Synthesis Network (MSFSN).