CRME: Cross-Resolution Enhancement Module
- CRME is an architectural unit that enables bidirectional, spatially aligned feature exchange across different resolutions and modalities.
- It employs scale- and modality-adaptive attention mechanisms with blockwise cross-attention to fuse global and local features efficiently.
- CRME overcomes naive fusion drawbacks by preserving fine texture details and global context while reducing computational costs via subdivided processing.
A Cross-Resolution Mutual Enhancement Module (CRME) is an architectural unit designed to enable efficient, spatially consistent exchange of information between features at different spatial resolutions or from different modalities. CRME models have emerged as key components for high-resolution vision–language tasks and cross-modal super-resolution, where maintaining both fine local structure and broad semantic context is critical. By leveraging scale- and modality-adaptive attention mechanisms, CRMEs avoid resolution mismatch losses and prohibitively expensive quadratic attention costs, achieving mutual contextualization and enhancement of dual feature streams with efficient memory and computation.
1. Motivation and Core Objectives
Naive approaches in high-resolution and cross-modal fusion often downsample high-detail features or independently process sub-images, leading to a loss of local detail, spatial misalignment, or fragmented global context. The primary objective of CRME is to enable two-way enhancement between feature streams of differing resolutions (e.g., global-context features and local-detail features, or thermal and optical modalities) while preserving both spatial and semantic precision. The CRME advances prior art by:
- Allowing mutual interaction between local and global features (or LR/HR modality features) at matching positions.
- Avoiding the full-resolution quadratic cost of direct cross-attention by restricting attention to subdivided sub-blocks or grid locations.
- Maintaining full spatial alignment, enabling fine texture transfer and global semantic refinement in unified representations.
Such functionality is critical in large multimodal LLMs—where global and local visual contexts must be simultaneously understood (Ma et al., 2024)—and in super-resolution settings, such as optics-guided UAV thermal image enhancement, where information from the HR optical domain must be judiciously transferred to the LR thermal channel (Zhao et al., 7 Jan 2026).
2. Canonical CRME Architectures
Two canonical instantiations of CRME have been recently formalized: the Dual-perspective Enhancement Module (DEM/CRME) in high-resolution MLLMs (Ma et al., 2024), and the cross-modal CRME in PCNet for UAV image super-resolution (Zhao et al., 7 Jan 2026).
Dual-Perspective CRME for High-Resolution MLLMs
- Input: High-resolution feature maps for local detail () and global context (), obtained via dual cropping and visual transformer encoding.
- Sub-block Cropping: Both features are re-cropped via global- and local-perspective windows into sub-blocks of size .
- Blockwise Cross-Attention: For each of blocks, cross-attention is computed between the aligned local and global sub-blocks via learned projections with single-head attention, yielding block-enhanced features.
- Recombination and Fusion: Enhanced sub-blocks are stitched back to form spatially aligned full-resolution maps, projected down and concatenated channel-wise for dual-enhanced fusion. Optionally, an average pooling and flattening operation produces final tokens.
Cross-Resolution Mutual Enhancement in Cross-Modal PCNet
- Input: Thermal features (LR), and co-registered optical features (HR).
- Hierarchical Context Encoding: Each stream is processed by a Hierarchical Transformer Layer (HTL).
- Scale-Adaptive Projections: HR optical is down-projected to LR ( as learnable 3×3 Conv, stride 2); LR thermal is up-projected to HR ( as ConvTranspose3×3 or PixelShuffle+Conv).
- Mutual Attention: SR branch (thermal queries optical): attention from HTL-encoded thermal tokens to downsampled optical. MC branch (optical queries thermal): attention from optical to upsampled thermal.
- Output: Enhanced thermal and enhanced optical features, spatially consistent and ready for subsequent physics-informed fusion or further HTL processing.
3. Formal Mathematical Definitions
DEM/CRME Blockwise Cross-Attention (Ma et al., 2024)
Let and be the global-perspective sub-blocks for global and local features (each ). For each :
Analogous operations are performed for the local-perspective enhancement. The enhanced sub-blocks are recombined, projected to reduced dimensionality via , then concatenated.
Cross-Modal CRME Attention (Zhao et al., 7 Jan 2026)
Define , .
Residual connections ensure feature preservation and progressive enrichment.
4. Layerwise and Forward-Pass Implementations
Pseudocode: High-Resolution CRME (Ma et al., 2024)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
Inputs: F_loc [w_h, h_h, d], F_glo [w_h, h_h, d] Parameters: W_q, W_k, W_v [d, d], W^glo, W^loc [d, d//2] G_glo_list = Crop_glo(F_glo) L_glo_list = Crop_glo(F_loc) V_glo_blocks = [] for i in range(N): Q = G_glo_list[i] @ W_q K = L_glo_list[i] @ W_k Vp = L_glo_list[i] @ W_v A = Softmax((Q @ K.T) / sqrt(d)) V_block = A @ Vp V_glo_blocks.append(V_block) V_glo = Recombine_glo(V_glo_blocks) G_loc_list = Crop_loc(F_glo) L_loc_list = Crop_loc(F_loc) V_loc_blocks = [] for i in range(N): Q = L_loc_list[i] @ W_q K = G_loc_list[i] @ W_k Vp = G_loc_list[i] @ W_v A = Softmax((Q @ K.T) / sqrt(d)) V_block = A @ Vp V_loc_blocks.append(V_block) V_loc = Recombine_loc(V_loc_blocks) Vg_emb = V_glo @ W^glo Vl_emb = V_loc @ W^loc V_dual = Concat(Vg_emb, Vl_emb, axis=channel) V_out = AvgPool_patch(V_dual) Tokens = Flatten(V_out) |
Pseudocode: Cross-Modal CRME (Zhao et al., 7 Jan 2026)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
def CRME(F_T_in, F_O_in): F_T_hat = HTL(F_T_in) # [H, W, C] F_O_hat = HTL(F_O_in) # [2H, 2W, C] F_O_down = Conv3x3_stride2(F_O_hat) F_T_up = UpsampleConv(F_T_hat) # SR branch: thermal queries optical Q_T = Linear1x1(F_T_hat) K_O = Linear1x1(F_O_down) V_O = Linear1x1(F_O_down) attn_T = Softmax(matmul(Q_T, K_O, 'dot') / sqrt(d_k)) F_T_tilde = F_T_hat + matmul(attn_T, V_O) # MC branch: optical queries thermal Q_O = Linear1x1(F_O_hat) K_T = Linear1x1(F_T_up) V_T = Linear1x1(F_T_up) attn_O = Softmax(matmul(Q_O, K_T, 'dot') / sqrt(d_k)) F_O_tilde = F_O_hat + matmul(attn_O, V_T) return F_T_tilde, F_O_tilde |
5. Design Choices, Hyperparameters, and Integration
Key module hyperparameters are task- and model-specific. Example instantiations:
| Setting | Channel Dim / | Num Heads | Sub-block size / Window | Projection Layer |
|---|---|---|---|---|
| DEM/CRME (Ma et al., 2024) | 1024 | 1 | : , | |
| PCNet CRME (Zhao et al., 7 Jan 2026) | 96 | 6 | Window 8 (HTL) | : Conv () |
- Pooling: Final avg-pool for visual tokens; in PCNet, multi-stage cascaded CRME.
- Integration: In PCNet, CRMEs are followed by physics-driven modules imposing Laplacian diffusion constraints.
- Normalization: No explicit LayerNorm or dropout is specified; self-correlation in HTL provides implicit feature stabilization.
- Residual connections: Applied at key enhancement stage outputs to preserve input information.
6. Empirical Effectiveness and Significance
The mutual cross-attention and dual-path architecture of CRMEs has been empirically shown to yield substantial gains on both multimodal language benchmarks and cross-modal super-resolution tasks:
- High-resolution MLLMs (Ma et al., 2024): On LLaVA-wild, DEM/CRME with both global and local enhancements outperformed single-path ablations by +4.6 points (67.5 vs. 62.9/62.7). Fusion by simple addition of features achieved only 60.1, verifying that blockwise cross-attention and explicit channel fusion are critical.
- Optics-guided thermal SR (Zhao et al., 7 Jan 2026): Preserving and adaptively projecting HR optical priors into LR thermal features via CRME produced sharper textures and fewer artifacts than zero-shot alignment or naive resizing, as measured by reconstruction, segmentation, and detection metrics.
A plausible implication is that the two-way, resolution-adaptive mutual enhancement in CRMEs makes them generally applicable beyond vision-language and cross-modal SR, whenever multi-scale or multi-modal alignment and feature enrichment are required with strict spatial correspondence.
7. Comparison to Prior and Related Designs
CRMEs represent a departure from previous cropping-only or dual-encoder merger approaches:
- Prior cropping-only methods: Processed windowed patches independently, losing global context and failing to enable mutual enhancement between neighboring sub-blocks.
- Dual-encoder fusion: Required heavy, separate encoders for each feature stream, with fusion often occurring at a single scale or via parameter-intensive backbone sharing.
- CRME advances: Employs symmetric re-cropping, local cross-attention confined to paired sub-blocks, and projection-fusion steps that preserve both efficiency and fine-grained alignment.
This has positioned CRME as a preferred module for tasks involving data with mismatched resolutions or modalities, subject to constraints on memory and computation and the demand for detailed, contextually grounded predictions.
References:
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal LLM (Ma et al., 2024)
- Physics-Constrained Cross-Resolution Enhancement Network for Optics-Guided Thermal UAV Image Super-Resolution (Zhao et al., 7 Jan 2026)