Papers
Topics
Authors
Recent
Search
2000 character limit reached

CRME: Cross-Resolution Enhancement Module

Updated 14 January 2026
  • CRME is an architectural unit that enables bidirectional, spatially aligned feature exchange across different resolutions and modalities.
  • It employs scale- and modality-adaptive attention mechanisms with blockwise cross-attention to fuse global and local features efficiently.
  • CRME overcomes naive fusion drawbacks by preserving fine texture details and global context while reducing computational costs via subdivided processing.

A Cross-Resolution Mutual Enhancement Module (CRME) is an architectural unit designed to enable efficient, spatially consistent exchange of information between features at different spatial resolutions or from different modalities. CRME models have emerged as key components for high-resolution vision–language tasks and cross-modal super-resolution, where maintaining both fine local structure and broad semantic context is critical. By leveraging scale- and modality-adaptive attention mechanisms, CRMEs avoid resolution mismatch losses and prohibitively expensive quadratic attention costs, achieving mutual contextualization and enhancement of dual feature streams with efficient memory and computation.

1. Motivation and Core Objectives

Naive approaches in high-resolution and cross-modal fusion often downsample high-detail features or independently process sub-images, leading to a loss of local detail, spatial misalignment, or fragmented global context. The primary objective of CRME is to enable two-way enhancement between feature streams of differing resolutions (e.g., global-context features and local-detail features, or thermal and optical modalities) while preserving both spatial and semantic precision. The CRME advances prior art by:

  • Allowing mutual interaction between local and global features (or LR/HR modality features) at matching positions.
  • Avoiding the full-resolution quadratic cost of direct cross-attention by restricting attention to subdivided sub-blocks or grid locations.
  • Maintaining full spatial alignment, enabling fine texture transfer and global semantic refinement in unified representations.

Such functionality is critical in large multimodal LLMs—where global and local visual contexts must be simultaneously understood (Ma et al., 2024)—and in super-resolution settings, such as optics-guided UAV thermal image enhancement, where information from the HR optical domain must be judiciously transferred to the LR thermal channel (Zhao et al., 7 Jan 2026).

2. Canonical CRME Architectures

Two canonical instantiations of CRME have been recently formalized: the Dual-perspective Enhancement Module (DEM/CRME) in high-resolution MLLMs (Ma et al., 2024), and the cross-modal CRME in PCNet for UAV image super-resolution (Zhao et al., 7 Jan 2026).

Dual-Perspective CRME for High-Resolution MLLMs

  • Input: High-resolution feature maps for local detail (FlocRwh×hh×d\mathbf F^{\rm loc}\in \mathbb R^{w_h\times h_h\times d}) and global context (FgloRwh×hh×d\mathbf F^{\rm glo}\in \mathbb R^{w_h\times h_h\times d}), obtained via dual cropping and visual transformer encoding.
  • Sub-block Cropping: Both features are re-cropped via global- and local-perspective windows into NN sub-blocks of size wl×hl×dw_l\times h_l\times d.
  • Blockwise Cross-Attention: For each of NN blocks, cross-attention is computed between the aligned local and global sub-blocks via learned projections Wq,Wk,WvW_q,\,W_k,\,W_v with single-head attention, yielding block-enhanced features.
  • Recombination and Fusion: Enhanced sub-blocks are stitched back to form spatially aligned full-resolution maps, projected down and concatenated channel-wise for dual-enhanced fusion. Optionally, an average pooling and flattening operation produces final tokens.

Cross-Resolution Mutual Enhancement in Cross-Modal PCNet

  • Input: Thermal features FT(in)RH×W×CF_T^{(\mathrm{in})}\in \mathbb{R}^{H\times W\times C} (LR), and co-registered optical features FO(in)R2H×2W×CF_O^{(\mathrm{in})}\in \mathbb{R}^{2H \times 2W \times C} (HR).
  • Hierarchical Context Encoding: Each stream is processed by a Hierarchical Transformer Layer (HTL).
  • Scale-Adaptive Projections: HR optical is down-projected to LR (P\mathcal P_\downarrow as learnable 3×3 Conv, stride 2); LR thermal is up-projected to HR (P\mathcal P_\uparrow as ConvTranspose3×3 or PixelShuffle+Conv).
  • Mutual Attention: SR branch (thermal queries optical): attention from HTL-encoded thermal tokens to downsampled optical. MC branch (optical queries thermal): attention from optical to upsampled thermal.
  • Output: Enhanced thermal F~T\tilde F_T and enhanced optical F~O\tilde F_O features, spatially consistent and ready for subsequent physics-informed fusion or further HTL processing.

3. Formal Mathematical Definitions

Let {Giglo}i=1N\{\mathbf G^{\rm glo}_i\}_{i=1}^N and {Liglo}i=1N\{\mathbf L^{\rm glo}_i\}_{i=1}^N be the global-perspective sub-blocks for global and local features (each Rwl×hl×d\in \mathbb R^{w_l\times h_l\times d}). For each ii:

Qi=GigloWq,Ki=LigloWk,Vi=LigloWv\mathbf Q_i = \mathbf G^{\rm glo}_i W_q,\quad \mathbf K_i = \mathbf L^{\rm glo}_i W_k,\quad \mathbf V'_i = \mathbf L^{\rm glo}_i W_v

Aiglo=Softmax(QiKiTd)\mathbf A^{\rm glo}_i = \operatorname{Softmax}\left(\frac{\mathbf Q_i \mathbf K_i^T}{\sqrt{d}}\right)

Viglo=AigloVi\mathbf V^{\rm glo}_i = \mathbf A^{\rm glo}_i \mathbf V'_i

Analogous operations are performed for the local-perspective enhancement. The NN enhanced sub-blocks are recombined, projected to reduced dimensionality via Wglo,WlocW^{\mathrm{glo}}, W^{\mathrm{loc}}, then concatenated.

Define F^T=HTL(FT(in))\hat F_T = HTL(F_T^{(\mathrm{in})}), F^O=HTL(FO(in))\hat F_O = HTL(F_O^{(\mathrm{in})}).

FˉO=P(F^O)FˉT=P(F^T)\bar{F}_O = \mathcal{P}_\downarrow(\hat F_O) \qquad \bar{F}_T = \mathcal{P}_\uparrow(\hat F_T)

QT=F^TWQTKO=FˉOWKTVO=FˉOWVTQ_T = \hat F_T W_Q^T \qquad K_O = \bar{F}_O W_K^T \qquad V_O = \bar{F}_O W_V^T

F~T=F^T+Softmax(QTKOdk)VO\tilde F_T = \hat F_T + \mathrm{Softmax}\left( \frac{Q_T K_O^\top}{\sqrt{d_k}} \right)V_O

QO=F^OWQOKT=FˉTWKOVT=FˉTWVOQ_O = \hat F_O W_Q^O \qquad K_T = \bar{F}_T W_K^O \qquad V_T = \bar{F}_T W_V^O

F~O=F^O+Softmax(QOKTdk)VT\tilde F_O = \hat F_O + \mathrm{Softmax}\left( \frac{Q_O K_T^\top}{\sqrt{d_k}} \right)V_T

Residual connections ensure feature preservation and progressive enrichment.

4. Layerwise and Forward-Pass Implementations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Inputs: F_loc [w_h, h_h, d], F_glo [w_h, h_h, d]
Parameters: W_q, W_k, W_v [d, d], W^glo, W^loc [d, d//2]

G_glo_list = Crop_glo(F_glo)
L_glo_list = Crop_glo(F_loc)
V_glo_blocks = []
for i in range(N):
    Q = G_glo_list[i] @ W_q
    K = L_glo_list[i] @ W_k
    Vp = L_glo_list[i] @ W_v
    A = Softmax((Q @ K.T) / sqrt(d))
    V_block = A @ Vp
    V_glo_blocks.append(V_block)
V_glo = Recombine_glo(V_glo_blocks)

G_loc_list = Crop_loc(F_glo)
L_loc_list = Crop_loc(F_loc)
V_loc_blocks = []
for i in range(N):
    Q = L_loc_list[i] @ W_q
    K = G_loc_list[i] @ W_k
    Vp = G_loc_list[i] @ W_v
    A = Softmax((Q @ K.T) / sqrt(d))
    V_block = A @ Vp
    V_loc_blocks.append(V_block)
V_loc = Recombine_loc(V_loc_blocks)

Vg_emb = V_glo @ W^glo
Vl_emb = V_loc @ W^loc
V_dual = Concat(Vg_emb, Vl_emb, axis=channel)
V_out = AvgPool_patch(V_dual)
Tokens = Flatten(V_out)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def CRME(F_T_in, F_O_in):
    F_T_hat = HTL(F_T_in)   # [H, W, C]
    F_O_hat = HTL(F_O_in)   # [2H, 2W, C]
    F_O_down = Conv3x3_stride2(F_O_hat)
    F_T_up = UpsampleConv(F_T_hat)
    # SR branch: thermal queries optical
    Q_T = Linear1x1(F_T_hat)
    K_O = Linear1x1(F_O_down)
    V_O = Linear1x1(F_O_down)
    attn_T = Softmax(matmul(Q_T, K_O, 'dot') / sqrt(d_k))
    F_T_tilde = F_T_hat + matmul(attn_T, V_O)
    # MC branch: optical queries thermal
    Q_O = Linear1x1(F_O_hat)
    K_T = Linear1x1(F_T_up)
    V_T = Linear1x1(F_T_up)
    attn_O = Softmax(matmul(Q_O, K_T, 'dot') / sqrt(d_k))
    F_O_tilde = F_O_hat + matmul(attn_O, V_T)
    return F_T_tilde, F_O_tilde

5. Design Choices, Hyperparameters, and Integration

Key module hyperparameters are task- and model-specific. Example instantiations:

Setting Channel Dim dd/CC Num Heads Sub-block size / Window Projection Layer
DEM/CRME (Ma et al., 2024) 1024 1 24×2424\times24 Wq,Wk,Wv,Wglo,WlocW_q,W_k,W_v,W^{glo},W^{loc}: 1024×10241024\times1024, 1024×5121024\times512
PCNet CRME (Zhao et al., 7 Jan 2026) 96 6 Window 8 (HTL) WQ,K,VT,OW_{Q,K,V}^{T,O}: 1×11\times1 Conv (CdkC\rightarrow d_k)
  • Pooling: Final avg-pool for visual tokens; in PCNet, multi-stage cascaded CRME.
  • Integration: In PCNet, CRMEs are followed by physics-driven modules imposing Laplacian diffusion constraints.
  • Normalization: No explicit LayerNorm or dropout is specified; self-correlation in HTL provides implicit feature stabilization.
  • Residual connections: Applied at key enhancement stage outputs to preserve input information.

6. Empirical Effectiveness and Significance

The mutual cross-attention and dual-path architecture of CRMEs has been empirically shown to yield substantial gains on both multimodal language benchmarks and cross-modal super-resolution tasks:

  • High-resolution MLLMs (Ma et al., 2024): On LLaVA-wild, DEM/CRME with both global and local enhancements outperformed single-path ablations by +4.6 points (67.5 vs. 62.9/62.7). Fusion by simple addition of features achieved only 60.1, verifying that blockwise cross-attention and explicit channel fusion are critical.
  • Optics-guided thermal SR (Zhao et al., 7 Jan 2026): Preserving and adaptively projecting HR optical priors into LR thermal features via CRME produced sharper textures and fewer artifacts than zero-shot alignment or naive resizing, as measured by reconstruction, segmentation, and detection metrics.

A plausible implication is that the two-way, resolution-adaptive mutual enhancement in CRMEs makes them generally applicable beyond vision-language and cross-modal SR, whenever multi-scale or multi-modal alignment and feature enrichment are required with strict spatial correspondence.

CRMEs represent a departure from previous cropping-only or dual-encoder merger approaches:

  • Prior cropping-only methods: Processed windowed patches independently, losing global context and failing to enable mutual enhancement between neighboring sub-blocks.
  • Dual-encoder fusion: Required heavy, separate encoders for each feature stream, with fusion often occurring at a single scale or via parameter-intensive backbone sharing.
  • CRME advances: Employs symmetric re-cropping, local cross-attention confined to paired sub-blocks, and projection-fusion steps that preserve both efficiency and fine-grained alignment.

This has positioned CRME as a preferred module for tasks involving data with mismatched resolutions or modalities, subject to constraints on memory and computation and the demand for detailed, contextually grounded predictions.


References:

  • INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal LLM (Ma et al., 2024)
  • Physics-Constrained Cross-Resolution Enhancement Network for Optics-Guided Thermal UAV Image Super-Resolution (Zhao et al., 7 Jan 2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Resolution Mutual Enhancement Module (CRME).