Dual Up-Sample Block Architecture
- Dual up-sample block architecture is a modular method for high-fidelity spatial upsampling that decomposes a large-ratio task into two sequential 2× stages to reduce artifacts and enhance detail.
- It integrates advanced alignment modules, learnable similarity metrics, and cross-stage detail propagation, enabling efficient video intra prediction and dense feature reconstruction.
- Empirical results demonstrate significant performance gains and BD-rate savings over traditional interpolation techniques and single-stage upsamplers.
A dual up-sample block architecture is a modular paradigm for high-fidelity up-sampling of spatial image or feature map representations, employing two sequential up-sampling stages rather than a single large-ratio mapping. It is increasingly central in codec-augmented neural networks for video intra prediction and in direct high-ratio feature upsampling for dense prediction. Dual up-sample blocks integrate advanced alignment modules, learnable similarity metrics, and explicit cross-stage detail propagation, enabling substantial performance gains over both traditional interpolation and single-stage neural upsamplers (Li et al., 2017, Zhou et al., 2024).
1. Architectural Principles and Dual-Stage Mechanism
Dual up-sample architectures decompose a large-ratio up-sampling task (e.g., spatial upscaling) into two cascaded stages. This decomposition is motivated by both practical and algorithmic considerations: localized context aggregation, reduced aliasing/artifacts, and the ability to employ finer boundary refinement at the second stage.
In intra-frame coding for video (HEVC setting), the dual up-sample scheme is realized as follows (Li et al., 2017):
- First Stage: Each Coding Tree Unit (CTU) is down-sampled, coded, and immediately up-sampled using only available boundaries (top, left) to restore it to reference size for subsequent predictions. Boundary pixels unavailable at this point are padded or replicated as needed.
- Second Stage: After all CTUs are coded, each is up-sampled again, now with access to all surrounding context, allowing for correction of boundary inconsistencies and artifact suppression along bottom/right edges.
In feature upsampling for dense prediction, a generic dual up-sample block can be constructed by composing two single-stage high-fidelity upsampling modules, e.g., two sequential ReSFU (Refreshed Similarity-based Feature Upsampler) modules (Zhou et al., 2024). Guidance features and alignment modules can be tailored in each stage to match context granularity.
2. Deep Neural Up-sampling Modules: Core Design
Two representative dual up-sample block types are prominent:
Table: Comparison of Dual Up-sample Block Instantiations
| Architecture | Context | Stage 1 Operation | Stage 2 Operation |
|---|---|---|---|
| CNN up-sampling+boundary ref | Codec (HEVC) | In-loop up-sample per CTU | Out-of-loop framewise up-sample/cleanup |
| ReSFU/ReSFU-dual | Deep learning | Feature up-sample () | Feature up-sample () with full context |
CNN for Intra Coding (Li et al., 2017):
- A five-layer CNN comprising multi-scale feature extraction, deconvolution (transposed convolution), multi-scale reconstruction, and residual learning.
- Luma: input , output residual over DCTIF baseline.
- Chroma: three-channel input, exploiting luma-coupled cross-channel cues for refinement.
- Objective: .
Dual ReSFU Block (Zhou et al., 2024):
- Each stage matches a upsampling, receiving (potentially) distinct HR guidance features.
- Internally, several modules are stacked: learnable projections, semantic/detail-aware alignment, parameterized paired central-difference convolution (PCDC) similarity, fine-grained neighbor selection (FNS), and a weighted sum.
- Final output is produced by .
3. Alignment, Contextual Matching, and Boundary Treatment
Preserving fidelity across block, feature, or image boundaries is a central challenge. Dual up-sample block architectures deploy context-aware mechanisms at each stage.
- Semantic and detail-aware alignment (ReSFU): Semantic alignment leverages a guided filter pushing HR queries into mutually consistent space with upsampled LR keys, parameterized by local window statistics:
Detail alignment applies a local Gaussian smoothing filter.
- Boundary conditions (HEVC block coding): At the first stage, only partial boundary context is accessible, so completion (zero-padding, replication) is used. The second up-sample is performed after all CTUs are decoded and uses complete context, ensuring that previous incomplete or erroneous boundaries are corrected specifically at bottom/right edges (Li et al., 2017).
- Fine-grained neighbor selection (FNS): In feature space, FNS computes HR neighborhoods on the upsampled key grid rather than LR grid, mitigating mosaic artifacts and supporting spatially coherent aggregation.
4. Rate-Distortion Optimization and Block Mode Decisions
Efficiency is maintained through tight rate-distortion (R-D) integration. For each block or CTU:
- Full-res mode: No downsampling, direct coding.
- Low-res mode: Block is downsampled, coded at lower QP (quantization parameter), and then upsampled (via DCTIF or CNN/ReSFU).
- Dynamic selection: For every block, both upsampling algorithms are tested; the one with lower distortion is chosen (1 bit/channel flag). The overall block mode is selected by minimizing full-res R-D cost,
Empirically, , leading to a derived and for block consistency; for , this yields (Li et al., 2017).
5. Learned Similarity and Kernel Aggregation in Feature Upsampling
Recent advances (ReSFU (Zhou et al., 2024)) improve feature upsampling by replacing non-learnable similarity computations with parameterized kernel-based matching:
- Paired central-difference convolution (PCDC):
This generalizes fixed similarity/aggregation through learnable local differences, made explicit for each query–key pair and facilitating flexible spatial aggregation.
- The softmax over patchwise similarities ensures differentiable, normalized neighborhood weighting for reconstruction.
Weight sharing across both stages is configurable. Experiments find that sharing core convolution/projection layers between stages preserves detail, but learning independent guided filter parameters per stage can yield sharper boundaries in the refinement pass (Zhou et al., 2024).
6. Empirical Performance and Application Domains
Dual up-sample blocks have demonstrated significant advantages:
- Intra frame coding (HEVC): Average BD-rate savings over anchor codecs are (PSNR) for generic test classes and up to for ultra-high-definition content. Chroma components benefit disproportionately from cross-channel CNN upsampling. The two-stage refinement contributes an additional (Y), (U/V) BD-rate gain (Li et al., 2017).
- Dense feature upsampling: ReSFU dual blocks enable direct (or higher) upsampling and are compatible with diverse network backbones for dense prediction tasks, with fine-grained neighbor aggregation reducing common mosaic artifacts and enabling deployment in arbitrary architectures (Zhou et al., 2024).
A notable effect is that at higher spatial resolutions, the portion of blocks/CTUs favoring low-res mode—and thus leveraging dual up-sample blocks—increases, amplifying gains.
7. Variants, Practical Guidelines, and Adaptability
In practice, multiple dual up-sample configurations exist:
- Fully independent stages: Each stage uses a separate upsampling module—best for flexibility and sharpness.
- Partially shared weights: Projection and convolution layers are shared, but adaptive alignment (e.g., guided filter params) remains per-stage.
- Fully shared: Parameter economy at cost of some expressivity, feasible for memory-limited deployments.
Neighbor sets, guided filter radius, and convolution dilation must be adapted per stage (e.g., in feature upsampling: stage 1 uses dilation $2$, stage 2 uses dilation $1$).
These architectures are fully differentiable, allowing end-to-end training. This ensures detail from the first stage propagates seamlessly to the refinement stage, in both codec and neural net contexts.
A plausible implication is that future codecs and dense prediction networks will continue to emphasize multi-stage, context-aware up-sample modules, exploiting learnable similarity and dynamic context fusion (Li et al., 2017, Zhou et al., 2024).