Adaptive Receptive Field Routing MoE

Updated 28 January 2026

The paper introduces AR²-MoE, a module that adaptively routes pixel-level features by selecting between local-detail and global-context experts for optimal change detection.
It employs a dual-branch architecture with depth-wise and dilated convolutions alongside a hard gating mechanism using a Straight-Through Estimator for precise expert selection.
Empirical results demonstrate improved F1 scores on heterogeneous remote sensing datasets, confirming its effectiveness over conventional convolutional blocks and naive multi-kernel fusion.

The Adaptive Receptive Field Routing Mixture-of-Experts (AR²-MoE) module is a pixel-wise dynamic routing block designed to address the challenges of modality-adaptive remote sensing change detection. It is integrated within deep encoder architectures as a specialized residual block replacement, enabling conditional selection between experts specializing in local spatial detail or global semantic context at each spatial location. AR²-MoE employs hard, per-pixel routing controlled by a gating network conditioned on input features and modality/domain code, facilitating instance-adaptive receptive field assignment to optimally handle both homogeneous and heterogeneous inputs (Shu et al., 21 Jan 2026).

1. Block-Level Architecture

AR²-MoE is structured around two complementary expert branches—local-detail and global-context—coupled by a trainable, pixel-wise, hard routing mechanism. The local-detail expert uses a 3×3 depth-wise separable convolution focused on precise boundary information, while the global-context expert implements a decomposed dilated convolution (standard depth-wise 3×3, followed by depth-wise dilated 3×3 with dilation factor 3, a pointwise 1×1 convolution, and element-wise modulation of the input) to aggregate extensive contextual information and suppress noise.

The routing decision at each spatial location is enacted by a gating network producing a binary mask via a Straight-Through Estimator (STE). The overall data flow at a block is as follows, where $X \in \mathbb{R}^{b \times c \times H \times W}$ denotes the block input and $z$ the domain code:

Compute $E_\text{local}(X)$ and $E_\text{gce}(X)$ as expert outputs.
Calculate gating probability $g$ using FiLM-style conditioned features.
Use STE for hard binary routing $M$ .
Output: $Y = (1-M) \odot E_\text{local}(X) + M \odot E_\text{gce}(X) + X$ .

2. Mathematical Formulation

The AR²-MoE module's components are mathematically defined as follows:

Local-Detail Expert: $E_\text{local}(X) = \Phi_\text{local}(X; W_{3 \times 3})$
Global-Context Expert:

$E_\text{gce}(X) = F_\text{pw}\bigl(F_\text{dilated}(F_\text{dw}(X))\bigr) \odot X,$

where: - $F_\text{dw}$ : depth-wise $3\times3$ conv - $F_\text{dilated}$ : depth-wise $3\times3$ conv with dilation $d=3$ - $F_\text{pw}$ : $1\times1$ conv - $"\odot"$ : element-wise re-weighting by the global context.

Gating Network: For each pixel,

$g = \sigma\Bigl(W_g\bigl(\gamma(z) \odot \phi(X) + \beta(z)\bigr)\Bigr)$

where $\phi$ is a $1\times1$ conv projection, $\gamma(z)$ and $\beta(z)$ provide FiLM conditioning, $W_g$ is linear, and $\sigma$ is the sigmoid activation.

Straight-Through Estimator (STE) for hard gating:

$M = \mathbf{1}(g > 0.5) - \mathrm{detach}(g) + g$

Output Assembly:

$Y = (1-M) \odot E_\text{local}(X) + M \odot E_\text{gce}(X) + X$

3. Training and Inference Procedures

Training and inference in AR²-MoE both utilize the same forward-pass process, including the deterministic, pixel-wise gating via STE. Gradients during backpropagation pass through $g$ , not $M$ , allowing hard routing decisions while maintaining end-to-end differentiability. The module is agnostic to the specific domain code $z$ , permitting conditioning on either one-hot or learned codes depending on the setting.

Pseudocode:

def AR2_MoE_forward(X, z):
    # 1. Expert branches
    F_loc  = DepthwiseSepConv3x3(X)      # E_loc(X)
    T1     = DepthwiseConv3x3(X)         # F_dw(X)
    T2     = DepthwiseDilatedConv3x3(T1, dilation=3)  # F_dilated
    T3     = PointwiseConv1x1(T2)        # F_pw
    F_gce  = T3 * X                      # E_gce(X)
    
    # 2. Gating probabilities
    P = phi(X)                           # 1x1 conv
    H = gamma(z) * P + beta(z)           # FiLM conditioning
    g = sigmoid(W_g(H))                  # [0,1] per-pixel
    
    # 3. Hard routing via STE
    M = STE_threshold(g, threshold=0.5)
    
    # 4. Assemble output
    Y = (1 - M) * F_loc + M * F_gce + X
    return Y  # Gradients flow through g in STE

During inference, the routing remains hard and deterministic.

4. Dynamic Behavior and Per-Pixel Adaptation

AR²-MoE implements strict binary, pixel-level decision-making: for each spatial location, the gating network selects exclusively either the local-detail or the global-context expert based on features and the current domain code. This enables precise modeling of multimodal requirements—for example, selecting local experts for optical image boundaries or global experts in SAR images for speckle suppression—without blending incompatible outputs. The mechanism resolves the often-conflicting requirements for high spatial detail and context aggregation in modality-adaptive change detection tasks.

5. Regularization and Losses

To stabilize the discrete gating process and avoid degenerate undecided gates, AR²-MoE employs an entropy penalty on the soft gating probability $g$ during unified pre-training:

$\mathcal{L}_{ent} = -\frac{1}{|\Omega|} \sum_{u\in \Omega} [ g_u \log(g_u + \epsilon) + (1-g_u)\log(1-g_u+\epsilon) ]$

This regularizer encourages low-entropy, nearly binary gating decisions, enforcing confident pixel-level expert selection. The entropy penalty is disabled during downstream self-distillation fine-tuning phases.

6. Computational Complexity

The AR²-MoE module imposes a moderate increase in model size and computational cost relative to a standard $3 \times 3$ convolution block, due to its dual-branch architecture and gating:

Module	Parameters	FLOPs
Standard $3\times3$ Conv	$9C^2$	$9C^2HW$
AR²-MoE	$3C^2+27C$ (local + global + gate)	$\sim$ depth-wise + $1\times1$ convs

The overall parameter count is approximately $1/3$ higher than a vanilla $3\times3$ convolution. The design leverages inexpensive depth-wise convolutions to minimize overhead and support scalable multi-expert modeling.

7. Empirical Performance and Ablation Studies

Empirical results show that AR²-MoE consistently outperforms baseline residual blocks and non-dynamically fused multi-kernel (e.g., $3\times3+5\times5+7\times7$ ) approaches on several remote sensing change detection benchmarks, especially where data modalities or geometries are heterogeneous. Ablation studies indicate that placing AR²-MoE at deeper encoder stages improves F1 performance on representative datasets (e.g., LEVIR-CD: baseline 90.34, full AR²-MoE 91.86; HTCD: 95.30 to 96.40; MT-Wuhan: 55.80 to 59.79).

Single-expert controls, in which only the local or global expert is retained (removing routing), underperform: "local only" yields 56.20% and "global only" 59.80% F1 on the challenging MT-Wuhan set, well below the fully routed AR²-MoE. Similarly, naive multi-kernel fusion achieves only 58.45%. This establishes the necessity and effectiveness of pixel-wise dynamic expert routing in achieving high accuracy across varied remote sensing regimes (Shu et al., 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

UniRoute: Unified Routing Mixture-of-Experts for Modality-Adaptive Remote Sensing Change Detection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Receptive Field Routing MoE (AR2-MoE).