CRME: Cross-Resolution Enhancement Module

Updated 14 January 2026

CRME is an architectural unit that enables bidirectional, spatially aligned feature exchange across different resolutions and modalities.
It employs scale- and modality-adaptive attention mechanisms with blockwise cross-attention to fuse global and local features efficiently.
CRME overcomes naive fusion drawbacks by preserving fine texture details and global context while reducing computational costs via subdivided processing.

A Cross-Resolution Mutual Enhancement Module (CRME) is an architectural unit designed to enable efficient, spatially consistent exchange of information between features at different spatial resolutions or from different modalities. CRME models have emerged as key components for high-resolution vision–language tasks and cross-modal super-resolution, where maintaining both fine local structure and broad semantic context is critical. By leveraging scale- and modality-adaptive attention mechanisms, CRMEs avoid resolution mismatch losses and prohibitively expensive quadratic attention costs, achieving mutual contextualization and enhancement of dual feature streams with efficient memory and computation.

1. Motivation and Core Objectives

Naive approaches in high-resolution and cross-modal fusion often downsample high-detail features or independently process sub-images, leading to a loss of local detail, spatial misalignment, or fragmented global context. The primary objective of CRME is to enable two-way enhancement between feature streams of differing resolutions (e.g., global-context features and local-detail features, or thermal and optical modalities) while preserving both spatial and semantic precision. The CRME advances prior art by:

Allowing mutual interaction between local and global features (or LR/HR modality features) at matching positions.
Avoiding the full-resolution quadratic cost of direct cross-attention by restricting attention to subdivided sub-blocks or grid locations.
Maintaining full spatial alignment, enabling fine texture transfer and global semantic refinement in unified representations.

Such functionality is critical in large multimodal LLMs—where global and local visual contexts must be simultaneously understood (Ma et al., 2024)—and in super-resolution settings, such as optics-guided UAV thermal image enhancement, where information from the HR optical domain must be judiciously transferred to the LR thermal channel (Zhao et al., 7 Jan 2026).

2. Canonical CRME Architectures

Two canonical instantiations of CRME have been recently formalized: the Dual-perspective Enhancement Module (DEM/CRME) in high-resolution MLLMs (Ma et al., 2024), and the cross-modal CRME in PCNet for UAV image super-resolution (Zhao et al., 7 Jan 2026).

Dual-Perspective CRME for High-Resolution MLLMs

Input: High-resolution feature maps for local detail ( $\mathbf F^{\rm loc}\in \mathbb R^{w_h\times h_h\times d}$ ) and global context ( $\mathbf F^{\rm glo}\in \mathbb R^{w_h\times h_h\times d}$ ), obtained via dual cropping and visual transformer encoding.
Sub-block Cropping: Both features are re-cropped via global- and local-perspective windows into $N$ sub-blocks of size $w_l\times h_l\times d$ .
Blockwise Cross-Attention: For each of $N$ blocks, cross-attention is computed between the aligned local and global sub-blocks via learned projections $W_q,\,W_k,\,W_v$ with single-head attention, yielding block-enhanced features.
Recombination and Fusion: Enhanced sub-blocks are stitched back to form spatially aligned full-resolution maps, projected down and concatenated channel-wise for dual-enhanced fusion. Optionally, an average pooling and flattening operation produces final tokens.

Input: Thermal features $F_T^{(\mathrm{in})}\in \mathbb{R}^{H\times W\times C}$ (LR), and co-registered optical features $F_O^{(\mathrm{in})}\in \mathbb{R}^{2H \times 2W \times C}$ (HR).
Hierarchical Context Encoding: Each stream is processed by a Hierarchical Transformer Layer (HTL).
Scale-Adaptive Projections: HR optical is down-projected to LR ( $\mathcal P_\downarrow$ as learnable 3×3 Conv, stride 2); LR thermal is up-projected to HR ( $\mathcal P_\uparrow$ as ConvTranspose3×3 or PixelShuffle+Conv).
Mutual Attention: SR branch (thermal queries optical): attention from HTL-encoded thermal tokens to downsampled optical. MC branch (optical queries thermal): attention from optical to upsampled thermal.
Output: Enhanced thermal $\tilde F_T$ and enhanced optical $\tilde F_O$ features, spatially consistent and ready for subsequent physics-informed fusion or further HTL processing.

3. Formal Mathematical Definitions

Let $\{\mathbf G^{\rm glo}_i\}_{i=1}^N$ and $\{\mathbf L^{\rm glo}_i\}_{i=1}^N$ be the global-perspective sub-blocks for global and local features (each $\in \mathbb R^{w_l\times h_l\times d}$ ). For each $i$ :

$\mathbf Q_i = \mathbf G^{\rm glo}_i W_q,\quad \mathbf K_i = \mathbf L^{\rm glo}_i W_k,\quad \mathbf V'_i = \mathbf L^{\rm glo}_i W_v$

$\mathbf A^{\rm glo}_i = \operatorname{Softmax}\left(\frac{\mathbf Q_i \mathbf K_i^T}{\sqrt{d}}\right)$

$\mathbf V^{\rm glo}_i = \mathbf A^{\rm glo}_i \mathbf V'_i$

Analogous operations are performed for the local-perspective enhancement. The $N$ enhanced sub-blocks are recombined, projected to reduced dimensionality via $W^{\mathrm{glo}}, W^{\mathrm{loc}}$ , then concatenated.

Define $\hat F_T = HTL(F_T^{(\mathrm{in})})$ , $\hat F_O = HTL(F_O^{(\mathrm{in})})$ .

$\bar{F}_O = \mathcal{P}_\downarrow(\hat F_O) \qquad \bar{F}_T = \mathcal{P}_\uparrow(\hat F_T)$

$Q_T = \hat F_T W_Q^T \qquad K_O = \bar{F}_O W_K^T \qquad V_O = \bar{F}_O W_V^T$

$\tilde F_T = \hat F_T + \mathrm{Softmax}\left( \frac{Q_T K_O^\top}{\sqrt{d_k}} \right)V_O$

$Q_O = \hat F_O W_Q^O \qquad K_T = \bar{F}_T W_K^O \qquad V_T = \bar{F}_T W_V^O$

$\tilde F_O = \hat F_O + \mathrm{Softmax}\left( \frac{Q_O K_T^\top}{\sqrt{d_k}} \right)V_T$

Residual connections ensure feature preservation and progressive enrichment.

4. Layerwise and Forward-Pass Implementations

Inputs: F_loc [w_h, h_h, d], F_glo [w_h, h_h, d]
Parameters: W_q, W_k, W_v [d, d], W^glo, W^loc [d, d//2]

G_glo_list = Crop_glo(F_glo)
L_glo_list = Crop_glo(F_loc)
V_glo_blocks = []
for i in range(N):
    Q = G_glo_list[i] @ W_q
    K = L_glo_list[i] @ W_k
    Vp = L_glo_list[i] @ W_v
    A = Softmax((Q @ K.T) / sqrt(d))
    V_block = A @ Vp
    V_glo_blocks.append(V_block)
V_glo = Recombine_glo(V_glo_blocks)

G_loc_list = Crop_loc(F_glo)
L_loc_list = Crop_loc(F_loc)
V_loc_blocks = []
for i in range(N):
    Q = L_loc_list[i] @ W_q
    K = G_loc_list[i] @ W_k
    Vp = G_loc_list[i] @ W_v
    A = Softmax((Q @ K.T) / sqrt(d))
    V_block = A @ Vp
    V_loc_blocks.append(V_block)
V_loc = Recombine_loc(V_loc_blocks)

Vg_emb = V_glo @ W^glo
Vl_emb = V_loc @ W^loc
V_dual = Concat(Vg_emb, Vl_emb, axis=channel)
V_out = AvgPool_patch(V_dual)
Tokens = Flatten(V_out)

def CRME(F_T_in, F_O_in):
    F_T_hat = HTL(F_T_in)   # [H, W, C]
    F_O_hat = HTL(F_O_in)   # [2H, 2W, C]
    F_O_down = Conv3x3_stride2(F_O_hat)
    F_T_up = UpsampleConv(F_T_hat)
    # SR branch: thermal queries optical
    Q_T = Linear1x1(F_T_hat)
    K_O = Linear1x1(F_O_down)
    V_O = Linear1x1(F_O_down)
    attn_T = Softmax(matmul(Q_T, K_O, 'dot') / sqrt(d_k))
    F_T_tilde = F_T_hat + matmul(attn_T, V_O)
    # MC branch: optical queries thermal
    Q_O = Linear1x1(F_O_hat)
    K_T = Linear1x1(F_T_up)
    V_T = Linear1x1(F_T_up)
    attn_O = Softmax(matmul(Q_O, K_T, 'dot') / sqrt(d_k))
    F_O_tilde = F_O_hat + matmul(attn_O, V_T)
    return F_T_tilde, F_O_tilde

5. Design Choices, Hyperparameters, and Integration

Key module hyperparameters are task- and model-specific. Example instantiations:

Setting	Channel Dim $d$ / $C$	Num Heads	Sub-block size / Window	Projection Layer
DEM/CRME (Ma et al., 2024)	1024	1	$24\times24$	$W_q,W_k,W_v,W^{glo},W^{loc}$ : $1024\times1024$ , $1024\times512$
PCNet CRME (Zhao et al., 7 Jan 2026)	96	6	Window 8 (HTL)	$W_{Q,K,V}^{T,O}$ : $1\times1$ Conv ( $C\rightarrow d_k$ )

Pooling: Final avg-pool for visual tokens; in PCNet, multi-stage cascaded CRME.
Integration: In PCNet, CRMEs are followed by physics-driven modules imposing Laplacian diffusion constraints.
Normalization: No explicit LayerNorm or dropout is specified; self-correlation in HTL provides implicit feature stabilization.
Residual connections: Applied at key enhancement stage outputs to preserve input information.

6. Empirical Effectiveness and Significance

The mutual cross-attention and dual-path architecture of CRMEs has been empirically shown to yield substantial gains on both multimodal language benchmarks and cross-modal super-resolution tasks:

High-resolution MLLMs (Ma et al., 2024): On LLaVA-wild, DEM/CRME with both global and local enhancements outperformed single-path ablations by +4.6 points (67.5 vs. 62.9/62.7). Fusion by simple addition of features achieved only 60.1, verifying that blockwise cross-attention and explicit channel fusion are critical.
Optics-guided thermal SR (Zhao et al., 7 Jan 2026): Preserving and adaptively projecting HR optical priors into LR thermal features via CRME produced sharper textures and fewer artifacts than zero-shot alignment or naive resizing, as measured by reconstruction, segmentation, and detection metrics.

A plausible implication is that the two-way, resolution-adaptive mutual enhancement in CRMEs makes them generally applicable beyond vision-language and cross-modal SR, whenever multi-scale or multi-modal alignment and feature enrichment are required with strict spatial correspondence.

CRMEs represent a departure from previous cropping-only or dual-encoder merger approaches:

Prior cropping-only methods: Processed windowed patches independently, losing global context and failing to enable mutual enhancement between neighboring sub-blocks.
Dual-encoder fusion: Required heavy, separate encoders for each feature stream, with fusion often occurring at a single scale or via parameter-intensive backbone sharing.
CRME advances: Employs symmetric re-cropping, local cross-attention confined to paired sub-blocks, and projection-fusion steps that preserve both efficiency and fine-grained alignment.

This has positioned CRME as a preferred module for tasks involving data with mismatched resolutions or modalities, subject to constraints on memory and computation and the demand for detailed, contextually grounded predictions.

References:

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal LLM (Ma et al., 2024)
Physics-Constrained Cross-Resolution Enhancement Network for Optics-Guided Thermal UAV Image Super-Resolution (Zhao et al., 7 Jan 2026)

Markdown Report Issue Upgrade to Chat

References (2)

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model (2024)

Physics-Constrained Cross-Resolution Enhancement Network for Optics-Guided Thermal UAV Image Super-Resolution (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Resolution Mutual Enhancement Module (CRME).

CRME: Cross-Resolution Enhancement Module

1. Motivation and Core Objectives

2. Canonical CRME Architectures

Dual-Perspective CRME for High-Resolution MLLMs

3. Formal Mathematical Definitions

DEM/CRME Blockwise Cross-Attention (Ma et al., 2024)

4. Layerwise and Forward-Pass Implementations

Pseudocode: High-Resolution CRME (Ma et al., 2024)

5. Design Choices, Hyperparameters, and Integration

6. Empirical Effectiveness and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

CRME: Cross-Resolution Enhancement Module

1. Motivation and Core Objectives

2. Canonical CRME Architectures

Dual-Perspective CRME for High-Resolution MLLMs

Cross-Resolution Mutual Enhancement in Cross-Modal PCNet

3. Formal Mathematical Definitions

DEM/CRME Blockwise Cross-Attention (Ma et al., 2024)

Cross-Modal CRME Attention (Zhao et al., 7 Jan 2026)

4. Layerwise and Forward-Pass Implementations

Pseudocode: High-Resolution CRME (Ma et al., 2024)

Pseudocode: Cross-Modal CRME (Zhao et al., 7 Jan 2026)

5. Design Choices, Hyperparameters, and Integration

6. Empirical Effectiveness and Significance

7. Comparison to Prior and Related Designs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics