Residual Multi-Scale Modules (RMSM)

Updated 25 January 2026

The paper introduces RMSM as a neural network module that aggregates multi-scale features through parallel or hierarchical convolutions with residual addition.
RMSMs are implemented with 1D, 2D, and 3D convolutional structures and are applied in image, hyperspectral, and time-series domains.
Empirical results show measurable improvements in benchmarks like ImageNet, COCO, and disparity estimation when RMSMs are incorporated into deep learning architectures.

A Residual Multi-Scale Module (RMSM) is a neural network component designed to infuse multi-scale feature extraction into deep architectures via residual mechanisms. RMSMs have been instantiated in various modalities—including image, hyperspectral, and time series domains—using 1D, 2D, and 3D convolutional structures. Their defining characteristic is the ability to aggregate information across multiple spatial, spectral, or temporal scales within a single residual block, enhancing both representational power and trainability. RMSMs are often core building blocks in architectures for classification, disparity estimation, segmentation, and sequence modeling. Research demonstrates that their inclusion results in demonstrable improvements in task-specific metrics across a range of benchmarks (Gao et al., 2019, Rao et al., 2019, Zhang et al., 2020, Guo et al., 2021).

1. Core Principles and Formulation

RMSMs augment standard residual blocks by introducing parallel or hierarchical sub-paths, each responsible for feature extraction at a distinct scale (e.g., via varying kernel sizes, receptive field depths, or progressive upsampling/downsizing pathways). After scale-specific processing, outputs are merged (by concatenation/fusion or hierarchical summation), and the resulting feature map is added back to the input, in line with the residual paradigm.

A canonical RMSM can be mathematically described as:

Let $x$ be the block input.
For multiple branches $i$ (corresponding to different receptive fields or kernel sizes), compute feature maps $F_i(x)$ .
Aggregate: $F_{\mathrm{cat}}(x) = [F_1(x), F_2(x), ..., F_k(x)]$
Fuse via 1 $\times$ 1 (or higher-dimensional) convolution: $F_{\mathrm{fusion}}(x) = \text{ReLU}(\text{BN}(W_f * F_{\mathrm{cat}}(x) + b_f))$
Add residual: $y = x + F_{\mathrm{fusion}}(x)$

The specific details (e.g., type and number of branches, form of fusion, depth and type of residual links) vary by context and data modality (Gao et al., 2019, Zhang et al., 2020, Guo et al., 2021).

2. Architectural Variants

2.1 Hierarchical Multi-Scale Residual: Res2Net-style

The Res2Net block splits the feature map into $s$ channel groups and sequentially passes each through a 3 $\times$ 3 convolution, where each successive group depends on the output of the previous group, forming a hierarchy of receptive fields within a single block. This structure covers a continuum of scale depths and enables granular feature reuse: $\begin{aligned} y_1 &= x_1\ y_i &= K_i(x_i + y_{i-1}), \quad i=2,\dots,s\ Y &= F_{1\times1}([y_1, ..., y_s]) + X \end{aligned}$ This intrinsically granular multi-scale approach is compatible with state-of-the-art backbones (ResNet, ResNeXt, DLA), with typical hyperparameters $i$ 0, $i$ 1 ( $i$ 2 channels per block), and delivers consistent performance gains on large-scale visual benchmarks (Gao et al., 2019).

2.2 Parallel Multi-Scale Branching: Hyperspectral and Time-Series Domains

In hyperspectral image classification, RMSMs comprise several parallel 3D convolution branches, each with distinct kernel sizes aligned to spectral or spatial axes. For example:

Spectral: kernels 1%%%%12 $i$ 13%%%%3, 1%%%%14 $i$ 15%%%%5, 1%%%%16 $i$ 17%%%%7
Spatial: kernels 1%%%%18 $i$ 19%%%%1 (bottleneck), $F_i(x)$ 11 ( $F_i(x)$ 2) Branch outputs are concatenated, fused by 1%%%%22 $i$ 23%%%%1 convolution, and added via an identity shortcut for residual learning (Zhang et al., 2020).

A comparable structure for multivariate time-series (e.g., MRC-LSTM for financial series) uses three parallel 1D convs (kernels 1, 2, 3), channel-wise concatenation, 1D fusion, and residual addition. No batch normalization or dropout is used inside these modules (Guo et al., 2021).

2.3 Multi-Scale 3D Encoder-Decoders: Stereo Matching

In disparity map estimation, the RMSM operates on 3D cost volumes. The module alternates four-level 3D encoder-decoder stages with stride-2 down/upsampling, doubling/halving channels, and applies lateral residual connections at each level to preserve details. The final output is a refined cost volume enabling accurate disparity regression (Rao et al., 2019). All multi-scale fusions are performed by simple residual addition across scales, with no gating or learned weighting.

3. Integration Strategies and Training Practices

RMSMs are designed as drop-in replacements or augmentations to standard residual blocks, preserving shape compatibility. Integration points include:

Replacement of standard bottlenecks in CNNs, potentially combining with SE or other attention mechanisms (Gao et al., 2019).
Interleaving with LSTM/GRU encoders or localized CNN stages for hybrid architectures (Guo et al., 2021).
Use in encoder-decoder designs for structured prediction tasks (stereo matching, monocular depth) (Rao et al., 2019, Chen et al., 2019). Training best practices in the literature include batch normalization after each convolution (to stabilize dynamics), dropout optionally near classifier heads, dynamic learning rate schedules, and RMSProp/SGD with moderate batch sizes (Zhang et al., 2020). Hyperparameter tuning focuses on scale (number of branches/splits), kernel sizes, and channel widths.

4. Empirical Impact and Comparative Results

Experimental evidence demonstrates that RMSMs confer substantial improvements over both classical and prior deep learning methods in diverse tasks. For instance:

Classification: On ImageNet, Res2Net-50 achieves 22.01% top-1 error vs. ResNet-50's 23.85%. On hyperspectral benchmarks, MSRN delivers OA=99.07% and 99.96% (Indian Pines, Pavia U), exceeding predecessors by over 1 percentage point (Zhang et al., 2020, Gao et al., 2019).
Detection/Segmentation: Object detection on COCO (Faster R-CNN): AP = 33.7 (Res2Net-50) vs. 31.1 (ResNet-50). Semantic segmentation: mIoU improvement of +1.2–1.5 across Pascal VOC settings (Gao et al., 2019).
Disparity Estimation: The inclusion of a multi-scale residual 3D conv module in MSDC-Net halves error rates on Scene Flow (>1px error: 14.9% vs. 28.7%), with greatest accuracy when both 2D and 3D RMSMs are combined (Rao et al., 2019).
Time-Series Prediction: The MRC-LSTM model (RMSM + LSTM) outperforms alternative CNN and LSTM structures in cryptocurrency price forecasting (Guo et al., 2021).

5. Comparison with Other Multi-Scale Architectures

RMSMs generalize and subsume several prior multi-scale constructs:

Inception/FPN: Employ parallel branches with hand-crafted kernel sizes in separate network stages; RMSMs (especially Res2Net) provide a within-block, granular alternative, and are orthogonal to FPN usage (Gao et al., 2019).
ASPP (DeepLab): Uses parallel dilated convolutions at the head; RMSMs distribute multi-scale processing throughout all network stages.
Octave Conv/HRNet: Vary feature map resolution rather than receptive field depth; RMSMs maintain resolution while widening scale in depth or kernel span.

A summary table of key RMSM variants:

Variant	Fusion Mechanism	Multi-Scale Mode
Res2Net	Hierarchical residual splitting	Depth within block
MS-ResNet/MSRN	Parallel concat + 1x1x1 fusion	Kernel size (3D conv)
MRC-LSTM	Parallel concat + 1x1 fusion	Kernel size (1D conv)
MSDC-Net RMSM	Encoder-decoder + residual adds	Down/up-sampling

6. Limitations and Considerations

While RMSMs provide consistent performance boosts, there are practical limitations:

Parallelism: Hierarchical splitting (Res2Net) introduces dependencies that limit parallelism for large numbers of splits (scale $F_i(x)$ 5) (Gao et al., 2019).
Parameter Increase: Parallel-branch RMSMs increase parameter and FLOP counts relative to single-scale residual blocks.
Effectiveness with Image Size: For small spatial domains (e.g., CIFAR-32), the benefit saturates at moderate scale values (Gao et al., 2019).
Kernel Size Limits: In 1D and 3D formulations, very large kernels or deep stacking may be needed to capture extremely long-range dependencies (as in temporal or disparity contexts) (Guo et al., 2021, Rao et al., 2019).

7. Broader Applications and Extensions

RMSMs have been successfully leveraged across domains:

Vision: Backbone improvements for image classification, detection, segmentation, saliency, and localization (Gao et al., 2019).
3D Geometry: Cost-volume refinement in stereo, yielding finer disparity estimation (Rao et al., 2019).
Hyperspectral Processing: Joint spectral-spatial feature fusion for remote sensing (Zhang et al., 2020).
Time-Series Analysis: Adaptive multi-scale temporal modeling for financial forecasting (Guo et al., 2021).

The RMSM concept continues to be extended, including its fusion with dense feature fusion (for global-local reconstruction) (Chen et al., 2019), and its compatibility with various backbone and attention designs.

References:

"Res2Net: A New Multi-scale Backbone Architecture" (Gao et al., 2019)
"MSDC-Net: Multi-Scale Dense and Contextual Networks for Automated Disparity Map for Stereo Matching" (Rao et al., 2019)
"Hyperspectral Images Classification Based on Multi-scale Residual Network" (Zhang et al., 2020)
"MRC-LSTM: A Hybrid Approach of Multi-scale Residual CNN and LSTM to Predict Bitcoin Price" (Guo et al., 2021)
"Structure-Aware Residual Pyramid Network for Monocular Depth Estimation" (Chen et al., 2019)