ReFlow Inpainting Model
- The paper introduces a novel video inpainting architecture that decouples spatial and temporal reasoning for enhanced restoration quality.
- It leverages a dual-branch flow-fusion module and ConvLSTM to achieve robust temporal consistency, real-time performance, and an 80% parameter reduction over CombCN.
- The model is computationally efficient and delivers state-of-the-art visual fidelity with markedly reduced flicker in dynamic scenes.
A ReFlow-based Inpainting Model defines a frame-recurrent video inpainting architecture that decouples spatial and temporal reasoning through interplay between strong single-image inpainting, robust optical flow generation, and sequence modeling via a ConvLSTM. The model’s crucial innovation is its robust flow-generation module, which fuses flows derived from both inpainted frames and directly from corrupted frames, enabling temporally consistent inpainting in videos with arbitrary spatial sizes and temporal lengths in real time. This approach achieves state-of-the-art results in both visual fidelity and temporal stability, while remaining computationally efficient and parameter-efficient (Ding et al., 2019).
1. High-Level Architecture and Workflow
At each time step , the model receives:
- The raw frame containing arbitrary-shaped holes,
- The previous inpainted, temporally smoothed output .
Processing proceeds as follows:
- A strong image-inpainting network (partial convolution, as in Liu et al.) produces a plausible but temporally incoherent prediction .
- The robust, blended optical flow between the previous and current frame is computed through the flow-fusion module .
- Encoded representations of and are concatenated and input to a ConvLSTM along with its previous hidden () and cell () states to update the hidden representation.
- A lightweight decoder produces a residual output, refining the previous output: .
- Because the architecture is fully convolutional and recurrent, it can process video sequences of arbitrary spatial resolution and temporal length in a single streaming pass (Ding et al., 2019).
ConvLSTM updates follow: where denotes convolution and is elementwise multiplication.
2. Robust Flow Generation and Fusion
Direct optical flow estimation between hole-ridden frames is unreliable. The framework introduces a dual-branch flow inference and fusion:
- Flow from Inpainted Frames: Optical flow is estimated between inpainted frames and using an off-the-shelf estimator (e.g., FlowNet2). This yields smooth flow but may propagate hallucinations/artifacts present in .
- Flow Completion Branch: Flow is estimated directly between the corrupted frames and , then inpainted within missing regions by a small U-Net (), producing . This respects real flow statistics, but may suffer from seams at boundaries.
- Flow Blending Network : A 6-layer U-Net fuses and via a learned residual-blended average:
The blending network’s output is supervised by loss against ground-truth flow :
This scheme isolates flow estimation errors due to holes, achieving robust inter-frame motion guidance.
3. Objective Functions
The overall training objective is a weighted sum of six terms: with the authors’ weighting scheme , subject to further manual tuning.
Defined losses include:
- Spatial Reconstruction:
where masks out non-hole regions.
- Perceptual Loss (VGG feature space):
- Short-term Temporal Consistency:
and likewise in reverse: .
- Long-term Consistency:
No adversarial loss is employed.
4. Training Strategy and Implementation
Training proceeds in two phases:
- Pre-train the single-image inpainter and flow-completion network separately. is optimized on static images with incomplete regions; on incomplete flow data.
- Freeze and . Jointly train the flow-fusion network and the ConvLSTM-based temporal smoother .
Implementation details:
- Adam optimizer with learning rate .
- At test time: on 32-frame clips; real-time at 30 FPS on a Titan Xp.
- Data augmentation: random cropping/rotation for DAVIS+VIDEVO; center cropping for FaceForensics.
- No specified batch size or epoch count in the report.
The fully convolutional and recurrent structure allows for arbitrary spatial and temporal scales during inference.
5. Quantitative and Qualitative Evaluation
Evaluation employs:
- Datasets: FaceForensics (1,004 facial clips), DAVIS+VIDEVO (190 wild-object clips).
- Masks: fixed rectangle, random rectangles per frame, random walker paths.
- Metric: frame-wise error.
Comparison with CombCN (Wang et al. 2018) as the reference end-to-end video-inpainting CNN shows:
| Dataset | CombCN | ReFlow |
|---|---|---|
| FaceForensics | 8.20 | 6.10 |
| DAVIS+VIDEVO | 40.87 | 13.89 |
- Inference time: $23.3$ ms/frame (ReFlow) vs. $36.1$ ms/frame (CombCN), approximately faster.
- Parameters: reduction over CombCN.
Qualitatively, outputs show sharper details and substantially reduced flicker, especially with moving objects or dynamically changing masks. No adversarial training is used; visual quality derives from loss structure and architecture (Ding et al., 2019).
6. Significance and Context in Video Inpainting
The ReFlow-based inpainting model represents a marked shift from direct spatio-temporal “all-at-once” completion to a hybrid approach where spatial and temporal consistency are handled by specialized modules—single-image inpainting and robust, flow-guided ConvLSTM smoothing. The two-branch flow-fusion module is architected to address the unreliability of flow estimation in the presence of large, arbitrarily shaped holes, which is a persistent challenge for video inpainting.
A plausible implication is that decoupling spatial and temporal modeling via this architecture facilitates scale-out to very large and long video streams, with real-time performance and strong generalization. The methodology demonstrates that competitive video inpainting performance can be achieved without adversarial losses or bulky, monolithic networks, and strongly motivates the use of modular, robust flow-guided temporal models in future video restoration research (Ding et al., 2019).