Adaptive Receptive Field Routing MoE
- The paper introduces AR²-MoE, a module that adaptively routes pixel-level features by selecting between local-detail and global-context experts for optimal change detection.
- It employs a dual-branch architecture with depth-wise and dilated convolutions alongside a hard gating mechanism using a Straight-Through Estimator for precise expert selection.
- Empirical results demonstrate improved F1 scores on heterogeneous remote sensing datasets, confirming its effectiveness over conventional convolutional blocks and naive multi-kernel fusion.
The Adaptive Receptive Field Routing Mixture-of-Experts (AR²-MoE) module is a pixel-wise dynamic routing block designed to address the challenges of modality-adaptive remote sensing change detection. It is integrated within deep encoder architectures as a specialized residual block replacement, enabling conditional selection between experts specializing in local spatial detail or global semantic context at each spatial location. AR²-MoE employs hard, per-pixel routing controlled by a gating network conditioned on input features and modality/domain code, facilitating instance-adaptive receptive field assignment to optimally handle both homogeneous and heterogeneous inputs (Shu et al., 21 Jan 2026).
1. Block-Level Architecture
AR²-MoE is structured around two complementary expert branches—local-detail and global-context—coupled by a trainable, pixel-wise, hard routing mechanism. The local-detail expert uses a 3×3 depth-wise separable convolution focused on precise boundary information, while the global-context expert implements a decomposed dilated convolution (standard depth-wise 3×3, followed by depth-wise dilated 3×3 with dilation factor 3, a pointwise 1×1 convolution, and element-wise modulation of the input) to aggregate extensive contextual information and suppress noise.
The routing decision at each spatial location is enacted by a gating network producing a binary mask via a Straight-Through Estimator (STE). The overall data flow at a block is as follows, where denotes the block input and the domain code:
- Compute and as expert outputs.
- Calculate gating probability using FiLM-style conditioned features.
- Use STE for hard binary routing .
- Output: .
2. Mathematical Formulation
The AR²-MoE module's components are mathematically defined as follows:
- Local-Detail Expert:
- Global-Context Expert:
where: - : depth-wise conv - : depth-wise conv with dilation - : conv - : element-wise re-weighting by the global context.
- Gating Network: For each pixel,
where is a conv projection, and provide FiLM conditioning, is linear, and is the sigmoid activation.
- Straight-Through Estimator (STE) for hard gating:
- Output Assembly:
3. Training and Inference Procedures
Training and inference in AR²-MoE both utilize the same forward-pass process, including the deterministic, pixel-wise gating via STE. Gradients during backpropagation pass through , not , allowing hard routing decisions while maintaining end-to-end differentiability. The module is agnostic to the specific domain code , permitting conditioning on either one-hot or learned codes depending on the setting.
Pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
def AR2_MoE_forward(X, z): # 1. Expert branches F_loc = DepthwiseSepConv3x3(X) # E_loc(X) T1 = DepthwiseConv3x3(X) # F_dw(X) T2 = DepthwiseDilatedConv3x3(T1, dilation=3) # F_dilated T3 = PointwiseConv1x1(T2) # F_pw F_gce = T3 * X # E_gce(X) # 2. Gating probabilities P = phi(X) # 1x1 conv H = gamma(z) * P + beta(z) # FiLM conditioning g = sigmoid(W_g(H)) # [0,1] per-pixel # 3. Hard routing via STE M = STE_threshold(g, threshold=0.5) # 4. Assemble output Y = (1 - M) * F_loc + M * F_gce + X return Y # Gradients flow through g in STE |
4. Dynamic Behavior and Per-Pixel Adaptation
AR²-MoE implements strict binary, pixel-level decision-making: for each spatial location, the gating network selects exclusively either the local-detail or the global-context expert based on features and the current domain code. This enables precise modeling of multimodal requirements—for example, selecting local experts for optical image boundaries or global experts in SAR images for speckle suppression—without blending incompatible outputs. The mechanism resolves the often-conflicting requirements for high spatial detail and context aggregation in modality-adaptive change detection tasks.
5. Regularization and Losses
To stabilize the discrete gating process and avoid degenerate undecided gates, AR²-MoE employs an entropy penalty on the soft gating probability during unified pre-training:
This regularizer encourages low-entropy, nearly binary gating decisions, enforcing confident pixel-level expert selection. The entropy penalty is disabled during downstream self-distillation fine-tuning phases.
6. Computational Complexity
The AR²-MoE module imposes a moderate increase in model size and computational cost relative to a standard convolution block, due to its dual-branch architecture and gating:
| Module | Parameters | FLOPs |
|---|---|---|
| Standard Conv | ||
| AR²-MoE | (local + global + gate) | depth-wise + convs |
The overall parameter count is approximately $1/3$ higher than a vanilla convolution. The design leverages inexpensive depth-wise convolutions to minimize overhead and support scalable multi-expert modeling.
7. Empirical Performance and Ablation Studies
Empirical results show that AR²-MoE consistently outperforms baseline residual blocks and non-dynamically fused multi-kernel (e.g., ) approaches on several remote sensing change detection benchmarks, especially where data modalities or geometries are heterogeneous. Ablation studies indicate that placing AR²-MoE at deeper encoder stages improves F1 performance on representative datasets (e.g., LEVIR-CD: baseline 90.34, full AR²-MoE 91.86; HTCD: 95.30 to 96.40; MT-Wuhan: 55.80 to 59.79).
Single-expert controls, in which only the local or global expert is retained (removing routing), underperform: "local only" yields 56.20% and "global only" 59.80% F1 on the challenging MT-Wuhan set, well below the fully routed AR²-MoE. Similarly, naive multi-kernel fusion achieves only 58.45%. This establishes the necessity and effectiveness of pixel-wise dynamic expert routing in achieving high accuracy across varied remote sensing regimes (Shu et al., 21 Jan 2026).