SLGNet: Language-Guided Multimodal Detector
- SLGNet is a parameter-efficient multimodal object detection framework that integrates hierarchical structural priors and language-guided modulation with a frozen Vision Transformer backbone.
- Its Structure-Aware Adapter fuses multi-scale RGB–IR features via sparse deformable attention, enhancing robustness to domain gaps and environmental variations.
- Language-Guided Modulation leverages scene captions to recalibrate visual features, achieving state-of-the-art performance on RGB–IR benchmarks with greatly reduced parameters.
SLGNet is a parameter-efficient multimodal object detection framework that integrates hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT) backbone. It is architected to robustly detect objects in RGB–Infrared (IR) scenarios, addressing domain gaps and environmental variations while minimizing trainable parameters. The key innovations of SLGNet are the Structure-Aware Adapter (SA-Adapter)—which delivers explicit multi-scale structural cues from both modalities to compensate for ViT backbone limitations—and the Language-Guided Modulation (LGM), which leverages structured captions from a frozen vision–LLM to dynamically recalibrate visual features based on scene context. SLGNet delivers state-of-the-art results on multiple RGB–IR benchmarks, significantly reducing parameter requirements relative to traditional full fine-tuning approaches (Xiang et al., 5 Jan 2026).
1. Architectural Components and Data Flow
SLGNet utilizes a frozen ViT-Base backbone (e.g., DINOv2-pretrained), augmented by two modular additions:
- Vision Transformer Backbone: Processes RGB images into patch embeddings, running through 12 transformer stages at a fixed spatial stride of 1/16.
- Structure-Aware Adapter (SA-Adapter): Composed of an S-Encoder extracting multi-scale, edge- and contour-oriented features from both RGB and IR images, followed by the FF-Adapter, which injects these priors into the ViT token stream via a sparse deformable-attention mechanism.
- Language-Guided Modulation (LGM): Employs a vision–LLM (Qwen2.5-VL) to generate structured scene captions in four semantic fields—Environment, Scene Type, Object Density, Thermal Signature. These are encoded with CLIP-Text, fused with an MLP, and distilled into channel-wise scale () and shift () vectors for affine feature recalibration.
Data Flow Summary:
| Modality/Input | Pathway | Output/Fusion |
|---|---|---|
| RGB | ViT Patch Embedding → ViT + FF-Adapter (injects {Fₒₗ}) | |
| IR | S-Encoder (multi-scale structure extraction only) | Features , edges |
| RGB + IR | Qwen2.5-VL (structured caption) → CLIP-Text → fusion → MLP → | |
| , | Affine modulation: | Detection head |
This architecture allows SLGNet to synergize multi-scale spatial structure with semantic scene context, enabling resilience in challenging environments (Xiang et al., 5 Jan 2026).
2. Structure-Aware Adapter: Multi-Scale Fusion and Injection
The SA-Adapter comprises two mechanisms:
Structure Encoder (S-Encoder):
For each scale (corresponding to 1/8, 1/16, 1/32 input resolution), stem features are hierarchically extracted from RGB and IR via sequential convolutions of increasing kernel size (, , ):
A Sobel operator produces edge maps and . These are fused to form a reference edge structure: (element-wise). SSIM-like similarity is then computed between each modality and the reference, yielding alignment weights , after sigmoid activation. The structural prior is and projected to the ViT token dimension via convolutions.
Feature Fusion Adapter (FF-Adapter):
At each ViT transformer stage , the structural priors are injected into the token stream via sparse deformable-style cross-level attention:
The attention, for each query token at normalized location , aggregates sampling points per scale, adapted by learnable offsets. The priors evolve across layers using an MLP: . This mechanism preserves domain-invariant edge and contour information across modalities throughout the token hierarchy (Xiang et al., 5 Jan 2026).
3. Language-Guided Modulation (LGM): Semantic Feature Calibration
The LGM module introduces environmental awareness by utilizing structured natural language context. An aligned RGB–IR pair is processed by a frozen vision–LLM to generate a caption with four fields: , . Each field is embedded using the CLIP Text Encoder (, , ).
These embeddings are concatenated and fused via a small MLP, pooled, and projected into channel-wise scale and shift vectors for modulation of the final ViT feature map:
This approach enables the model to recalibrate representation statistics based on high-level scene semantics, improving robustness to complex or changing environmental conditions (Xiang et al., 5 Jan 2026).
4. Forward Computation Workflow
The forward pass of SLGNet is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
function SLGNet_Forward(I_rgb, I_ir):
# 1. Structure-Aware Adapter (multi-scale priors)
for l in {1,2,3}:
F_v[l] = Conv_k_l_chain(I_rgb)
F_t[l] = Conv_k_l_chain(I_ir)
G_v = Sobel(F_v[l]); G_t = Sobel(F_t[l])
G_ref = max(G_v, G_t)
M_v′ = SSIM_like(G_v, G_ref); M_t′ = SSIM_like(G_t, G_ref)
M_v = sigmoid(M_v′); M_t = sigmoid(M_t′)
F_f[l] = M_v * F_v[l] + M_t * F_t[l]
F_p[l] = Conv1×1_proj_to_D(F_f[l])
# 2. ViT Backbone + FF-Adapter
tokens = ViT.PatchEmbed(I_rgb)
for i in 1..12:
tokens = ViT.Block[i].SelfAttention(tokens)
tokens = tokens + FF_Adapter(tokens, {F_p[1..3] at stage i})
tokens = ViT.Block[i].MLP(tokens)
F_vit = Reshape(tokens)
# 3. Language-Guided Modulation
captions = Qwen2.5_VL(I_rgb, I_ir)
for each s in captions:
F_ti = CLIP_Text(s)
F_t_sem = MLP_proj(concat(F_t_env, F_t_type, F_t_obj, F_t_therm))
pooled = MeanPool(F_t_sem)
γ = MLP_γ(pooled); β = MLP_β(pooled)
F_mod = (1+γ) ⊙ F_vit + β
# 4. Detection Head
outputs = DetectorHead(F_mod)
return outputs |
This structured sequence maintains multimodal feature alignment and injects contextual modulation late in the processing pipeline (Xiang et al., 5 Jan 2026).
5. Hyperparameters, Design Choices, and Stability
Key configuration values include:
| Component | Parameterization | Values |
|---|---|---|
| ViT token dimension | 768 | |
| S-Encoder channels | 64, 128, 256 (for 1/8, 1/16, 1/32) | |
| FF-Adapter: Sampling | (sampling points/query/scale) | 4 |
| FF-Adapter MLP (stage) | Hidden dimension | 512 |
| CLIP text embedding | (embedding dim), (token count) | , |
| MLP for text fusion | Hidden/output | 512, |
| MLP for , | Hidden/output | 512, |
| SSIM constants | , , range as in SSIM |
The architecture is designed for stability, leveraging frozen backbone and LLMs, with only 12M adapter-tunable parameters, minimizing the risk of catastrophic forgetting and ensuring efficient adaptation (Xiang et al., 5 Jan 2026).
6. Experimental Results and Comparative Analysis
Extensive evaluation across four RGB–IR object detection benchmarks demonstrates the efficacy of SLGNet:
- LLVIP (Low-light, pedestrian): mAP 66.1, mAP 98.3, 12.1M trainable parameters (87% fewer than full fine-tune).
- FLIR (Driving, multiclass): mAP 45.1 (prev. SOTA 44.6), mAP 85.8, 12.1M vs 244.6M parameters (DETR full-tune).
- KAIST (Day/Night pedestrian, MR): 19.88% (prev. best 23.74%). Day 21.01% (vs 23.95%), Night 20.56% (vs 19.42%).
- DroneVehicle (UAV vehicle, oriented): mAP 80.7 (vs WaveMamba 79.8), largest gains in Freight-Car class (+0.9).
Ablations:
- SA-Adapter alone yields +2.0 mAP (FLIR) and +1.3 mAP (DroneVehicle).
- LGM adds +0.8 mAP (FLIR), +2.1 mAP (DroneVehicle).
- Adapter-tuning vs. full-tuning: –87% params, +1.5 mAP (FLIR), +3.7 mAP (DroneVehicle), improved convergence and stability.
These results establish SLGNet as a state-of-the-art, scalable approach for multimodal perception under extreme lighting and thermal conditions, with robust generalization and efficiency (Xiang et al., 5 Jan 2026).