ReFlow Inpainting Model

Updated 17 January 2026

The paper introduces a novel video inpainting architecture that decouples spatial and temporal reasoning for enhanced restoration quality.
It leverages a dual-branch flow-fusion module and ConvLSTM to achieve robust temporal consistency, real-time performance, and an 80% parameter reduction over CombCN.
The model is computationally efficient and delivers state-of-the-art visual fidelity with markedly reduced flicker in dynamic scenes.

A ReFlow-based Inpainting Model defines a frame-recurrent video inpainting architecture that decouples spatial and temporal reasoning through interplay between strong single-image inpainting, robust optical flow generation, and sequence modeling via a ConvLSTM. The model’s crucial innovation is its robust flow-generation module, which fuses flows derived from both inpainted frames and directly from corrupted frames, enabling temporally consistent inpainting in videos with arbitrary spatial sizes and temporal lengths in real time. This approach achieves state-of-the-art results in both visual fidelity and temporal stability, while remaining computationally efficient and parameter-efficient (Ding et al., 2019).

1. High-Level Architecture and Workflow

At each time step $t$ , the model receives:

The raw frame $I_t$ containing arbitrary-shaped holes,
The previous inpainted, temporally smoothed output $O_{t-1}$ .

Processing proceeds as follows:

A strong image-inpainting network $\mathcal{H}_s$ (partial convolution, as in Liu et al.) produces a plausible but temporally incoherent prediction $P_t = \mathcal{H}_s(I_t)$ .
The robust, blended optical flow $F_{t-1,t}$ between the previous and current frame is computed through the flow-fusion module $\mathcal{H}_f$ .
Encoded representations of $O_{t-1}$ and $P_t$ are concatenated and input to a ConvLSTM $\mathcal{H}_t$ along with its previous hidden ( $H_{t-1}$ ) and cell ( $C_{t-1}$ ) states to update the hidden representation.
A lightweight decoder produces a residual output, refining the previous output: $O_t = O_{t-1} + \mathrm{decode}(H_t)$ .
Because the architecture is fully convolutional and recurrent, it can process video sequences of arbitrary spatial resolution and temporal length in a single streaming pass (Ding et al., 2019).

ConvLSTM updates follow: $\begin{aligned} i_t &= \sigma(W_{xi} \circledast X_t + W_{hi} \circledast H_{t-1} + W_{ci} \odot C_{t-1} + b_i),\ f_t &= \sigma(W_{xf} \circledast X_t + W_{hf} \circledast H_{t-1} + W_{cf} \odot C_{t-1} + b_f),\ C_t &= f_t \odot C_{t-1} + i_t \odot \tanh(W_{xc} \circledast X_t + W_{hc} \circledast H_{t-1} + b_c),\ o_t &= \sigma(W_{xo} \circledast X_t + W_{ho} \circledast H_{t-1} + W_{co} \odot C_t + b_o),\ H_t &= o_t \odot \tanh(C_t), \end{aligned}$ where $\circledast$ denotes convolution and $\odot$ is elementwise multiplication.

2. Robust Flow Generation and Fusion

Direct optical flow estimation between hole-ridden frames is unreliable. The framework introduces a dual-branch flow inference and fusion:

Flow from Inpainted Frames: Optical flow $F^P_{t-1,t}$ is estimated between inpainted frames $P_{t-1}$ and $P_t$ using an off-the-shelf estimator (e.g., FlowNet2). This yields smooth flow but may propagate hallucinations/artifacts present in $P$ .
Flow Completion Branch: Flow $\hat{F}^I_{t-1,t}$ is estimated directly between the corrupted frames $I_{t-1}$ and $I_t$ , then inpainted within missing regions by a small U-Net ( $\mathcal{H}_c$ ), producing $F^I_{t-1,t}$ . This respects real flow statistics, but may suffer from seams at boundaries.
Flow Blending Network $\mathcal{H}_f$ : A 6-layer U-Net fuses $F^P_{t-1,t}$ and $F^I_{t-1,t}$ via a learned residual-blended average:

$F_{t-1,t} = \frac{1}{2}[F^P_{t-1,t} + F^I_{t-1,t} + \mathcal{H}_f(F^P_{t-1,t}, F^I_{t-1,t})]$

The blending network’s output is supervised by $L_1$ loss against ground-truth flow $G^F_{t-1,t}$ :

$L_f = \sum_{t=1}^T \|G^F_{t-1,t} - F_{t-1,t}\|_1$

This scheme isolates flow estimation errors due to holes, achieving robust inter-frame motion guidance.

3. Objective Functions

The overall training objective is a weighted sum of six terms: $\mathcal{L} = \lambda_d L_d + \lambda_p L_p + \lambda_s L_s + \lambda_r L_r + \lambda_{\ell} L_{\ell} + \lambda_f L_f$ with the authors’ weighting scheme $\{\lambda_s,\lambda_d,\lambda_r\}:\{\lambda_f,\lambda_p,\lambda_\ell\}=10:1$ , subject to further manual tuning.

Defined losses include:

Spatial Reconstruction:

$L_d = \sum_{t=1}^T M_t \|O_t - G_t\|_1$

where $M_t$ masks out non-hole regions.

Perceptual Loss (VGG feature space):

$L_p = \frac{1}{N}\sum_{l=1}^N \sum_{t=1}^T \|\Phi_l(O_t)-\Phi_l(G_t)\|_1$

Short-term Temporal Consistency:

$L_s = \sum_{t=2}^T M_t \|O_t - \mathrm{Warp}(O_{t-1}, F_{t-1,t})\|_1$

and likewise in reverse: $L_r$ .

Long-term Consistency:

$L_\ell = \sum_{t=1}^T M_t \left( \|O_1 - \mathrm{Warp}(O_t, F_{t,1})\|_1 + \|O_T - \mathrm{Warp}(O_t, F_{t,T})\|_1 \right)$

No adversarial loss is employed.

4. Training Strategy and Implementation

Training proceeds in two phases:

Pre-train the single-image inpainter $\mathcal{H}_s$ and flow-completion network $\mathcal{H}_c$ separately. $\mathcal{H}_s$ is optimized on static images with incomplete regions; $\mathcal{H}_c$ on incomplete flow data.
Freeze $\mathcal{H}_s$ and $\mathcal{H}_c$ . Jointly train the flow-fusion network $\mathcal{H}_f$ and the ConvLSTM-based temporal smoother $\mathcal{H}_t$ .

Implementation details:

Adam optimizer with learning rate $10^{-4}$ .
At test time: $128 \times 128$ on 32-frame clips; real-time $256\times 256$ at 30 FPS on a Titan Xp.
Data augmentation: random cropping/rotation for DAVIS+VIDEVO; center cropping for FaceForensics.
No specified batch size or epoch count in the report.

The fully convolutional and recurrent structure allows for arbitrary spatial and temporal scales during inference.

5. Quantitative and Qualitative Evaluation

Evaluation employs:

Datasets: FaceForensics (1,004 facial clips), DAVIS+VIDEVO (190 wild-object clips).
Masks: fixed rectangle, random rectangles per frame, random walker paths.
Metric: frame-wise $L_1$ error.

Comparison with CombCN (Wang et al. 2018) as the reference end-to-end video-inpainting CNN shows:

Dataset	CombCN $L_1$	ReFlow $L_1$
FaceForensics	8.20	6.10
DAVIS+VIDEVO	40.87	13.89

Inference time: $23.3$ ms/frame (ReFlow) vs. $36.1$ ms/frame (CombCN), approximately $35\%$ faster.
Parameters: $80\%$ reduction over CombCN.

Qualitatively, outputs show sharper details and substantially reduced flicker, especially with moving objects or dynamically changing masks. No adversarial training is used; visual quality derives from loss structure and architecture (Ding et al., 2019).

6. Significance and Context in Video Inpainting

The ReFlow-based inpainting model represents a marked shift from direct spatio-temporal “all-at-once” completion to a hybrid approach where spatial and temporal consistency are handled by specialized modules—single-image inpainting and robust, flow-guided ConvLSTM smoothing. The two-branch flow-fusion module is architected to address the unreliability of flow estimation in the presence of large, arbitrarily shaped holes, which is a persistent challenge for video inpainting.

A plausible implication is that decoupling spatial and temporal modeling via this architecture facilitates scale-out to very large and long video streams, with real-time performance and strong generalization. The methodology demonstrates that competitive video inpainting performance can be achieved without adversarial losses or bulky, monolithic networks, and strongly motivates the use of modular, robust flow-guided temporal models in future video restoration research (Ding et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Frame-Recurrent Video Inpainting by Robust Optical Flow Inference (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ReFlow-based Inpainting Model.