Papers
Topics
Authors
Recent
Search
2000 character limit reached

ReFlow Inpainting Model

Updated 17 January 2026
  • The paper introduces a novel video inpainting architecture that decouples spatial and temporal reasoning for enhanced restoration quality.
  • It leverages a dual-branch flow-fusion module and ConvLSTM to achieve robust temporal consistency, real-time performance, and an 80% parameter reduction over CombCN.
  • The model is computationally efficient and delivers state-of-the-art visual fidelity with markedly reduced flicker in dynamic scenes.

A ReFlow-based Inpainting Model defines a frame-recurrent video inpainting architecture that decouples spatial and temporal reasoning through interplay between strong single-image inpainting, robust optical flow generation, and sequence modeling via a ConvLSTM. The model’s crucial innovation is its robust flow-generation module, which fuses flows derived from both inpainted frames and directly from corrupted frames, enabling temporally consistent inpainting in videos with arbitrary spatial sizes and temporal lengths in real time. This approach achieves state-of-the-art results in both visual fidelity and temporal stability, while remaining computationally efficient and parameter-efficient (Ding et al., 2019).

1. High-Level Architecture and Workflow

At each time step tt, the model receives:

  • The raw frame ItI_t containing arbitrary-shaped holes,
  • The previous inpainted, temporally smoothed output Ot1O_{t-1}.

Processing proceeds as follows:

  1. A strong image-inpainting network Hs\mathcal{H}_s (partial convolution, as in Liu et al.) produces a plausible but temporally incoherent prediction Pt=Hs(It)P_t = \mathcal{H}_s(I_t).
  2. The robust, blended optical flow Ft1,tF_{t-1,t} between the previous and current frame is computed through the flow-fusion module Hf\mathcal{H}_f.
  3. Encoded representations of Ot1O_{t-1} and PtP_t are concatenated and input to a ConvLSTM Ht\mathcal{H}_t along with its previous hidden (Ht1H_{t-1}) and cell (Ct1C_{t-1}) states to update the hidden representation.
  4. A lightweight decoder produces a residual output, refining the previous output: Ot=Ot1+decode(Ht)O_t = O_{t-1} + \mathrm{decode}(H_t).
  5. Because the architecture is fully convolutional and recurrent, it can process video sequences of arbitrary spatial resolution and temporal length in a single streaming pass (Ding et al., 2019).

ConvLSTM updates follow: it=σ(WxiXt+WhiHt1+WciCt1+bi), ft=σ(WxfXt+WhfHt1+WcfCt1+bf), Ct=ftCt1+ittanh(WxcXt+WhcHt1+bc), ot=σ(WxoXt+WhoHt1+WcoCt+bo), Ht=ottanh(Ct),\begin{aligned} i_t &= \sigma(W_{xi} \circledast X_t + W_{hi} \circledast H_{t-1} + W_{ci} \odot C_{t-1} + b_i),\ f_t &= \sigma(W_{xf} \circledast X_t + W_{hf} \circledast H_{t-1} + W_{cf} \odot C_{t-1} + b_f),\ C_t &= f_t \odot C_{t-1} + i_t \odot \tanh(W_{xc} \circledast X_t + W_{hc} \circledast H_{t-1} + b_c),\ o_t &= \sigma(W_{xo} \circledast X_t + W_{ho} \circledast H_{t-1} + W_{co} \odot C_t + b_o),\ H_t &= o_t \odot \tanh(C_t), \end{aligned} where \circledast denotes convolution and \odot is elementwise multiplication.

2. Robust Flow Generation and Fusion

Direct optical flow estimation between hole-ridden frames is unreliable. The framework introduces a dual-branch flow inference and fusion:

  1. Flow from Inpainted Frames: Optical flow Ft1,tPF^P_{t-1,t} is estimated between inpainted frames Pt1P_{t-1} and PtP_t using an off-the-shelf estimator (e.g., FlowNet2). This yields smooth flow but may propagate hallucinations/artifacts present in PP.
  2. Flow Completion Branch: Flow F^t1,tI\hat{F}^I_{t-1,t} is estimated directly between the corrupted frames It1I_{t-1} and ItI_t, then inpainted within missing regions by a small U-Net (Hc\mathcal{H}_c), producing Ft1,tIF^I_{t-1,t}. This respects real flow statistics, but may suffer from seams at boundaries.
  3. Flow Blending Network Hf\mathcal{H}_f: A 6-layer U-Net fuses Ft1,tPF^P_{t-1,t} and Ft1,tIF^I_{t-1,t} via a learned residual-blended average:

Ft1,t=12[Ft1,tP+Ft1,tI+Hf(Ft1,tP,Ft1,tI)]F_{t-1,t} = \frac{1}{2}[F^P_{t-1,t} + F^I_{t-1,t} + \mathcal{H}_f(F^P_{t-1,t}, F^I_{t-1,t})]

The blending network’s output is supervised by L1L_1 loss against ground-truth flow Gt1,tFG^F_{t-1,t}:

Lf=t=1TGt1,tFFt1,t1L_f = \sum_{t=1}^T \|G^F_{t-1,t} - F_{t-1,t}\|_1

This scheme isolates flow estimation errors due to holes, achieving robust inter-frame motion guidance.

3. Objective Functions

The overall training objective is a weighted sum of six terms: L=λdLd+λpLp+λsLs+λrLr+λL+λfLf\mathcal{L} = \lambda_d L_d + \lambda_p L_p + \lambda_s L_s + \lambda_r L_r + \lambda_{\ell} L_{\ell} + \lambda_f L_f with the authors’ weighting scheme {λs,λd,λr}:{λf,λp,λ}=10:1\{\lambda_s,\lambda_d,\lambda_r\}:\{\lambda_f,\lambda_p,\lambda_\ell\}=10:1, subject to further manual tuning.

Defined losses include:

  • Spatial Reconstruction:

Ld=t=1TMtOtGt1L_d = \sum_{t=1}^T M_t \|O_t - G_t\|_1

where MtM_t masks out non-hole regions.

Lp=1Nl=1Nt=1TΦl(Ot)Φl(Gt)1L_p = \frac{1}{N}\sum_{l=1}^N \sum_{t=1}^T \|\Phi_l(O_t)-\Phi_l(G_t)\|_1

  • Short-term Temporal Consistency:

Ls=t=2TMtOtWarp(Ot1,Ft1,t)1L_s = \sum_{t=2}^T M_t \|O_t - \mathrm{Warp}(O_{t-1}, F_{t-1,t})\|_1

and likewise in reverse: LrL_r.

  • Long-term Consistency:

L=t=1TMt(O1Warp(Ot,Ft,1)1+OTWarp(Ot,Ft,T)1)L_\ell = \sum_{t=1}^T M_t \left( \|O_1 - \mathrm{Warp}(O_t, F_{t,1})\|_1 + \|O_T - \mathrm{Warp}(O_t, F_{t,T})\|_1 \right)

No adversarial loss is employed.

4. Training Strategy and Implementation

Training proceeds in two phases:

  1. Pre-train the single-image inpainter Hs\mathcal{H}_s and flow-completion network Hc\mathcal{H}_c separately. Hs\mathcal{H}_s is optimized on static images with incomplete regions; Hc\mathcal{H}_c on incomplete flow data.
  2. Freeze Hs\mathcal{H}_s and Hc\mathcal{H}_c. Jointly train the flow-fusion network Hf\mathcal{H}_f and the ConvLSTM-based temporal smoother Ht\mathcal{H}_t.

Implementation details:

  • Adam optimizer with learning rate 10410^{-4}.
  • At test time: 128×128128 \times 128 on 32-frame clips; real-time 256×256256\times 256 at 30 FPS on a Titan Xp.
  • Data augmentation: random cropping/rotation for DAVIS+VIDEVO; center cropping for FaceForensics.
  • No specified batch size or epoch count in the report.

The fully convolutional and recurrent structure allows for arbitrary spatial and temporal scales during inference.

5. Quantitative and Qualitative Evaluation

Evaluation employs:

  • Datasets: FaceForensics (1,004 facial clips), DAVIS+VIDEVO (190 wild-object clips).
  • Masks: fixed rectangle, random rectangles per frame, random walker paths.
  • Metric: frame-wise L1L_1 error.

Comparison with CombCN (Wang et al. 2018) as the reference end-to-end video-inpainting CNN shows:

Dataset CombCN L1L_1 ReFlow L1L_1
FaceForensics 8.20 6.10
DAVIS+VIDEVO 40.87 13.89
  • Inference time: $23.3$ ms/frame (ReFlow) vs. $36.1$ ms/frame (CombCN), approximately 35%35\% faster.
  • Parameters: 80%80\% reduction over CombCN.

Qualitatively, outputs show sharper details and substantially reduced flicker, especially with moving objects or dynamically changing masks. No adversarial training is used; visual quality derives from loss structure and architecture (Ding et al., 2019).

6. Significance and Context in Video Inpainting

The ReFlow-based inpainting model represents a marked shift from direct spatio-temporal “all-at-once” completion to a hybrid approach where spatial and temporal consistency are handled by specialized modules—single-image inpainting and robust, flow-guided ConvLSTM smoothing. The two-branch flow-fusion module is architected to address the unreliability of flow estimation in the presence of large, arbitrarily shaped holes, which is a persistent challenge for video inpainting.

A plausible implication is that decoupling spatial and temporal modeling via this architecture facilitates scale-out to very large and long video streams, with real-time performance and strong generalization. The methodology demonstrates that competitive video inpainting performance can be achieved without adversarial losses or bulky, monolithic networks, and strongly motivates the use of modular, robust flow-guided temporal models in future video restoration research (Ding et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ReFlow-based Inpainting Model.