Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Scale Frame Synthesis Network (MSFSN)

Updated 29 January 2026
  • The paper introduces MSFSN, a unified framework that synthesizes intermediate or future video frames using a multi-scale pyramid architecture and transitive consistency loss.
  • MSFSN employs weight-shared CNN sub-networks and normalized temporal conditioning to enable interpolation and extrapolation without explicit optical flow estimation.
  • The model achieves competitive pixel accuracy and perceptual quality while remaining compact, and it robustly handles occlusions and large deformations in video sequences.

The Multi-Scale Frame-Synthesis Network (MSFSN) is a unified deep learning framework for high-fidelity video frame interpolation and extrapolation. Distinguished by its multi-scale (pyramid) architecture, normalized temporal conditioning, and the introduction of a transitive consistency loss, MSFSN synthesizes intermediate or future frames directly from input pairs, eschewing explicit optical flow estimation. This approach achieves competitive performance against state-of-the-art optical-flow-based and learning-based video synthesis methods in terms of pixel accuracy, perceptual quality, and computational efficiency (Hu et al., 2017).

1. Architectural Principles

The MSFSN is structured as a hierarchical pyramid with SS levels (default S=4S = 4). At the coarsest level, sub-network N1N_1 processes two downsampled input frames, xt11,xt21Rm/2S1×n/2S1x_{t_1}^1, x_{t_2}^1 \in \mathbb{R}^{m/2^{S-1} \times n/2^{S-1}}, producing a coarse prediction ytp1y_{t_p}^1. Each subsequent level s=2,,Ss=2,\ldots,S utilizes a single sub-network NN (weight-shared across levels), which receives downsampled inputs (xt1s,xt2s)(x_{t_1}^s, x_{t_2}^s) as well as an upsampled (2×2\times via pixel-shuffle) version of the previous scale's prediction, forming (xt1s,xt2s,ytps1)(x_{t_1}^s, x_{t_2}^s, \uparrow y_{t_p}^{s-1}) as input.

Each sub-network comprises:

  • An initial 5×55\times5 convolutional layer with 64 filters,
  • DD residual blocks (each containing two 5×55\times5 convolutions with skip-connections and LeakyReLU activations with α=0.2\alpha=0.2, omitting batch-normalization),
  • A final 5×55\times5 convolution outputting the RGB frame prediction.

This parameter-sharing design makes model size independent of SS. Empirically, D=9D=9 achieves a practical equilibrium between inference speed and frame quality.

2. Temporal Conditioning via Relative Indexing

MSFSN frames synthesis as a continuous function of a normalized relative timestamp,

rp=tpt1t2t1(,)r_p = \frac{t_p - t_1}{t_2 - t_1} \in (-\infty, \infty)

This scalar, packaged as an additional input channel, is concatenated to both input frames and (as needed) to generated frames for loss computation. During inference, rp(0,1)r_p\in(0,1) requests interpolation, while rp<0r_p<0 or rp>1r_p>1 performs extrapolation—without requiring model reconfiguration or retraining. This parameterization enables flexible, position-agnostic prediction across the temporal span defined by {xt1,xt2}\{x_{t_1}, x_{t_2}\}.

3. Objective and Transitive Consistency Loss

MSFSN is optimized via a composite objective: L(G,D)=Lpix+λfeatLfeat+λGANLGAN+λtranLtran\mathcal{L}(G, D) = \mathcal{L}_{pix} + \lambda_{feat}\,\mathcal{L}_{feat} + \lambda_{GAN}\,\mathcal{L}_{GAN} + \lambda_{tran}\,\mathcal{L}_{tran}

  • Pixel Reconstruction Loss (1\ell_1): Enforces fidelity to ground-truth ytpy_{t_p},

Lpix(G)=E[G(xt1,xt2,tp)ytp1]\mathcal{L}_{pix}(G) = \mathbb{E}[\|G(x_{t_1},x_{t_2},t_p)-y_{t_p}\|_1]

  • Feature (Perceptual) Loss: Measures difference in VGG-16 feature space,

Lfeat(G)=E[ϕ(G(xt1,xt2,tp))ϕ(ytp)2]\mathcal{L}_{feat}(G) = \mathbb{E}[\|\phi(G(x_{t_1},x_{t_2},t_p))-\phi(y_{t_p})\|_2]

  • Adversarial Loss: Employs discriminator DD to promote photorealism,

LGAN(G,D)=E[logD(xtp)]+E[log(1D(G(xt1,xt2,tp)))]\mathcal{L}_{GAN}(G, D) = \mathbb{E}[\log D(x_{t_p})] + \mathbb{E}[\log(1 - D(G(x_{t_1},x_{t_2},t_p)))]

  • Transitive Consistency Loss: Encourages the generator GG to be time-reversible,

Ltran(G)=E[G(xt1,ytp,t2)xt21]+E[G(ytp,xt2,t1)xt11]\mathcal{L}_{tran}(G) = \mathbb{E}[\|G(x_{t_1}, y_{t_p}, t_2) - x_{t_2}\|_1] + \mathbb{E}[\|G(y_{t_p}, x_{t_2}, t_1) - x_{t_1}\|_1]

This regularization enforces GG to reconstruct xt2x_{t_2} and xt1x_{t_1} when the prediction ytpy_{t_p} is considered as an effective intermediary, driving temporal coherence and consistency in synthesized results.

4. Training Methodology and Hyperparameters

Training relies on triplets (xt1,xt2,tp)(x_{t_1},x_{t_2},t_p) sampled randomly from GOPRO, UCF-101, and THUMOS-15 datasets. Preprocessing includes rotations, flips, noise injection, and cropping (to 128×128128\times128 patches), with mirror-padding at test time to ensure divisibility by 2S12^{S-1}. Optimization utilizes the Adam algorithm (β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999), with mini-batches of eight and a staged learning rate schedule (initially 10410^{-4} for generator pre-training, then annealing during adversarial fine-tuning). Loss weights are set as λfeat=2×105\lambda_{feat}=2\times10^{-5}, λGAN=5×102\lambda_{GAN}=5\times10^{-2}, and λtran=0.2\lambda_{tran}=0.2.

During both training and inference, the number of pyramid levels SS may be varied, with increased levels enhancing robustness to large motions due to coarser initial predictions. For all tasks, losses are computed solely at the finest scale, SS.

5. Quantitative and Qualitative Performance

Quantitative Benchmarks

MSFSN demonstrates strong results on both interpolation and extrapolation tasks across UCF-101 and THUMOS-15 test sets, measured using PSNR and SSIM:

Method Interp UCF-101 PSNR/SSIM Interp THUMOS-15 PSNR/SSIM Extrap UCF-101 PSNR/SSIM Extrap THUMOS-15 PSNR/SSIM
BeyondMSE 32.8 / 0.93 32.3 / 0.91 30.6 / 0.90 30.2 / 0.89
EpicFlow 34.2 / 0.95 33.9 / 0.94 31.3 / 0.92 31.0 / 0.92
FlowNet2 34.0 / 0.94 33.8 / 0.94 31.8 / 0.92 31.7 / 0.92
DVF 35.8 / 0.95 35.4 / 0.95 32.7 / 0.93 32.2 / 0.92
AdapSC 36.2 / 0.95 36.4 / 0.96
Ours-Gen 36.0 / 0.95 35.5 / 0.95 32.8 / 0.93 32.2 / 0.92
Ours 35.8 / 0.95 35.2 / 0.95 32.4 / 0.93 31.9 / 0.92

Model compactness is notable, with 7.4M parameters and a memory footprint of 29.7MB (for D=9D=9), significantly smaller than DVF and AdapSC.

Qualitative Observations and Limitations

MSFSN suffers fewer warping artifacts than optical flow approaches in scenarios with occlusion and large deformations, generating visually plausible, temporally coherent frames. The adversarial loss notably enhances sharpness and texture. In multi-frame interpolation (e.g., on KITTI), MSFSN surpasses two-stage synthesis approaches in preserving structural regularity.

User study results (42 participants) indicate a consistent preference for MSFSN outputs over those of FlowNet2, EpicFlow, DVF, and equivalence with AdapSC.

Limitations include a tendency for results to appear slightly blurrier than perfect correspondence-based approaches (e.g., AdapSC) when optical flow is reliable, and a restricted capacity to model highly non-linear motions due to the scalar parameterization of time.

6. Significance and Application Scope

MSFSN unifies interpolation and extrapolation within a single well-regularized model, dispensing with dataset- or task-specific retraining. Its pyramid design and weight-sharing mechanisms facilitate adaptation to a broad range of motions and video domains. The model achieves a favorable trade-off between visual fidelity, artifact minimization, and computational efficiency. Its ability to leverage arbitrary relative positions (rpRr_p\in \mathbb{R}) enables continuous frame synthesis within and beyond the initially observed temporal window, which is critical for applications in frame upsampling, video restoration, and predictive video modeling.

7. Relation to Prior Work and Future Directions

Unlike classical interpolation/extrapolation techniques that depend on optical flow or explicit pixel correspondences—often leading to artifacts in presence of flow estimation failures—MSFSN accomplishes frame synthesis in a direct, feedforward manner and shows robust performance across occlusions and large deformations (Hu et al., 2017). Compared to earlier autoencoder approaches, MSFSN's multi-scale parameterized design and transitive loss yield improved flexibility and temporal consistency.

A plausible implication is that future research may extend MSFSN's parameterization (e.g., high-order time-parameter conditioning) or integrate explicit trajectory modeling to address limitations with complex nonlinear motions. Model compression and domain adaption could advance deployment in real-time or resource-constrained settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Scale Frame-Synthesis Network (MSFSN).