SLGNet: Language-Guided Multimodal Detector

Updated 12 January 2026

SLGNet is a parameter-efficient multimodal object detection framework that integrates hierarchical structural priors and language-guided modulation with a frozen Vision Transformer backbone.
Its Structure-Aware Adapter fuses multi-scale RGB–IR features via sparse deformable attention, enhancing robustness to domain gaps and environmental variations.
Language-Guided Modulation leverages scene captions to recalibrate visual features, achieving state-of-the-art performance on RGB–IR benchmarks with greatly reduced parameters.

SLGNet is a parameter-efficient multimodal object detection framework that integrates hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT) backbone. It is architected to robustly detect objects in RGB–Infrared (IR) scenarios, addressing domain gaps and environmental variations while minimizing trainable parameters. The key innovations of SLGNet are the Structure-Aware Adapter (SA-Adapter)—which delivers explicit multi-scale structural cues from both modalities to compensate for ViT backbone limitations—and the Language-Guided Modulation (LGM), which leverages structured captions from a frozen vision–LLM to dynamically recalibrate visual features based on scene context. SLGNet delivers state-of-the-art results on multiple RGB–IR benchmarks, significantly reducing parameter requirements relative to traditional full fine-tuning approaches (Xiang et al., 5 Jan 2026).

1. Architectural Components and Data Flow

SLGNet utilizes a frozen ViT-Base backbone (e.g., DINOv2-pretrained), augmented by two modular additions:

Vision Transformer Backbone: Processes RGB images into patch embeddings, running through 12 transformer stages at a fixed spatial stride of 1/16.
Structure-Aware Adapter (SA-Adapter): Composed of an S-Encoder extracting multi-scale, edge- and contour-oriented features from both RGB and IR images, followed by the FF-Adapter, which injects these priors into the ViT token stream via a sparse deformable-attention mechanism.
Language-Guided Modulation (LGM): Employs a vision–LLM (Qwen2.5-VL) to generate structured scene captions in four semantic fields—Environment, Scene Type, Object Density, Thermal Signature. These are encoded with CLIP-Text, fused with an MLP, and distilled into channel-wise scale ( $\gamma$ ) and shift ( $\beta$ ) vectors for affine feature recalibration.

Data Flow Summary:

Modality/Input	Pathway	Output/Fusion
RGB	ViT Patch Embedding → ViT + FF-Adapter (injects {Fₒₗ})	$F_{\text{vit}}$
IR	S-Encoder (multi-scale structure extraction only)	Features $F_{t,l}$ , edges
RGB + IR	Qwen2.5-VL (structured caption) → CLIP-Text → fusion → MLP → $(\gamma, \beta)$	$F_{t}^{\mathrm{sem}}$
$F_{\text{vit}}$ , $F_{t}^{\mathrm{sem}}$	Affine modulation: $F_{\text{vit}}^{\text{guided}} = (1+\gamma) \odot F_{\text{vit}} + \beta$	Detection head

This architecture allows SLGNet to synergize multi-scale spatial structure with semantic scene context, enabling resilience in challenging environments (Xiang et al., 5 Jan 2026).

2. Structure-Aware Adapter: Multi-Scale Fusion and Injection

The SA-Adapter comprises two mechanisms:

Structure Encoder (S-Encoder):

For each scale $l \in \{1,2,3\}$ (corresponding to 1/8, 1/16, 1/32 input resolution), stem features are hierarchically extracted from RGB and IR via sequential convolutions of increasing kernel size ( $k_1=3$ , $k_2=5$ , $k_3=7$ ):

$F_{v,l} = \mathrm{Conv}_{k_l}(\ldots \mathrm{Conv}_{k_1}(I_v) \ldots)$
$F_{t,l} = \mathrm{Conv}_{k_l}(\ldots \mathrm{Conv}_{k_1}(I_t) \ldots)$

A Sobel operator produces edge maps $\nabla F_{v,l}$ and $\nabla F_{t,l}$ . These are fused to form a reference edge structure: $\nabla_{\text{ref}} = \max(\nabla F_{v,l}, \nabla F_{t,l})$ (element-wise). SSIM-like similarity is then computed between each modality and the reference, yielding alignment weights $M_v$ , $M_t$ after sigmoid activation. The structural prior is $F_{f,l} = M_v \cdot F_{v,l} + M_t \cdot F_{t,l}$ and projected to the ViT token dimension via $1 \times 1$ convolutions.

Feature Fusion Adapter (FF-Adapter):

At each ViT transformer stage $i$ , the structural priors are injected into the token stream via sparse deformable-style cross-level attention:

$\hat{F}^{(i)}_{\text{vit}} = F^{(i)}_{\text{vit}} + \operatorname{Attn}_{\text{sparse}}(F^{(i)}_{\text{vit}}, \{F^{(i)}_{f,l}\}_{l=1}^{3})$

The attention, for each query token at normalized location $p_q$ , aggregates $K=4$ sampling points per scale, adapted by learnable offsets. The priors evolve across layers using an MLP: $F^{(i)}_{f,l} = \operatorname{MLP}_{\text{stage}}(F^{(i-1)}_{f,l})$ . This mechanism preserves domain-invariant edge and contour information across modalities throughout the token hierarchy (Xiang et al., 5 Jan 2026).

3. Language-Guided Modulation (LGM): Semantic Feature Calibration

The LGM module introduces environmental awareness by utilizing structured natural language context. An aligned RGB–IR pair is processed by a frozen vision–LLM to generate a caption with four fields: $s_i$ , $i \in \{\text{env, type, obj, therm}\}$ . Each field is embedded using the CLIP Text Encoder ( $F_{t,i} \in \mathbb{R}^{L \times d}$ , $L=77$ , $d=768$ ).

These embeddings are concatenated and fused via a small MLP, pooled, and projected into channel-wise scale $\gamma$ and shift $\beta$ vectors for modulation of the final ViT feature map:

$F^{\mathrm{guided}}_{\mathrm{vit}} = (1 + \gamma) \odot F_{\mathrm{vit}} + \beta$

This approach enables the model to recalibrate representation statistics based on high-level scene semantics, improving robustness to complex or changing environmental conditions (Xiang et al., 5 Jan 2026).

4. Forward Computation Workflow

The forward pass of SLGNet is as follows:

function SLGNet_Forward(I_rgb, I_ir):
    # 1. Structure-Aware Adapter (multi-scale priors)
    for l in {1,2,3}:
        F_v[l] = Conv_k_l_chain(I_rgb)
        F_t[l] = Conv_k_l_chain(I_ir)
        G_v = Sobel(F_v[l]); G_t = Sobel(F_t[l])
        G_ref = max(G_v, G_t)
        M_v′ = SSIM_like(G_v, G_ref); M_t′ = SSIM_like(G_t, G_ref)
        M_v = sigmoid(M_v′); M_t = sigmoid(M_t′)
        F_f[l] = M_v * F_v[l] + M_t * F_t[l]
        F_p[l] = Conv1×1_proj_to_D(F_f[l])
    # 2. ViT Backbone + FF-Adapter
    tokens = ViT.PatchEmbed(I_rgb)
    for i in 1..12:
        tokens = ViT.Block[i].SelfAttention(tokens)
        tokens = tokens + FF_Adapter(tokens, {F_p[1..3] at stage i})
        tokens = ViT.Block[i].MLP(tokens)
    F_vit = Reshape(tokens)
    # 3. Language-Guided Modulation
    captions = Qwen2.5_VL(I_rgb, I_ir)
    for each s in captions:
        F_ti = CLIP_Text(s)
    F_t_sem = MLP_proj(concat(F_t_env, F_t_type, F_t_obj, F_t_therm))
    pooled = MeanPool(F_t_sem)
    γ = MLP_γ(pooled); β = MLP_β(pooled)
    F_mod = (1+γ) ⊙ F_vit + β
    # 4. Detection Head
    outputs = DetectorHead(F_mod)
    return outputs

This structured sequence maintains multimodal feature alignment and injects contextual modulation late in the processing pipeline (Xiang et al., 5 Jan 2026).

5. Hyperparameters, Design Choices, and Stability

Key configuration values include:

Component	Parameterization	Values
ViT token dimension	$D$	768
S-Encoder channels	$C_1, C_2, C_3$	64, 128, 256 (for 1/8, 1/16, 1/32)
FF-Adapter: Sampling	$K$ (sampling points/query/scale)	4
FF-Adapter MLP (stage)	Hidden dimension	512
CLIP text embedding	$d$ (embedding dim), $L$ (token count)	$d=768$ , $L=77$
MLP for text fusion	Hidden/output	512, $(4d) \rightarrow d$
MLP for $\gamma$ , $\beta$	Hidden/output	512, $C$
SSIM constants	$k_1=0.01$ , $k_2=0.03$ , $L$ range as in SSIM

The architecture is designed for stability, leveraging frozen backbone and LLMs, with only $\sim$ 12M adapter-tunable parameters, minimizing the risk of catastrophic forgetting and ensuring efficient adaptation (Xiang et al., 5 Jan 2026).

6. Experimental Results and Comparative Analysis

Extensive evaluation across four RGB–IR object detection benchmarks demonstrates the efficacy of SLGNet:

LLVIP (Low-light, pedestrian): mAP 66.1, mAP $_{50}$ 98.3, 12.1M trainable parameters (87% fewer than full fine-tune).
FLIR (Driving, multiclass): mAP 45.1 (prev. SOTA 44.6), mAP $_{50}$ 85.8, 12.1M vs 244.6M parameters (DETR full-tune).
KAIST (Day/Night pedestrian, MR $^{-2}$ ): 19.88% (prev. best 23.74%). Day 21.01% (vs 23.95%), Night 20.56% (vs 19.42%).
DroneVehicle (UAV vehicle, oriented): mAP 80.7 (vs WaveMamba 79.8), largest gains in Freight-Car class (+0.9).

Ablations:

SA-Adapter alone yields +2.0 mAP (FLIR) and +1.3 mAP (DroneVehicle).
LGM adds +0.8 mAP (FLIR), +2.1 mAP (DroneVehicle).
Adapter-tuning vs. full-tuning: –87% params, +1.5 mAP (FLIR), +3.7 mAP (DroneVehicle), improved convergence and stability.

These results establish SLGNet as a state-of-the-art, scalable approach for multimodal perception under extreme lighting and thermal conditions, with robust generalization and efficiency (Xiang et al., 5 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SLGNet.