Query-Gated Deformable Fusion (QGDF)

Updated 4 February 2026

The paper introduces QGDF, which integrates learned spatial deformation with query-conditioned gating to adaptively blend feature representations.
It employs deformable convolutional sampling and masked attention to dynamically fuse multi-view camera and LiDAR BEV features.
Empirical evaluations demonstrate improved tracking accuracy and robust autonomous driving performance with minimal additional computational overhead.

Query-Gated Deformable Fusion (QGDF) is a unified architectural mechanism for adaptively integrating feature representations—across spatial locations and modalities—by combining learned spatial deformation with query- or location-conditioned soft gating. It was originally introduced for deformable object tracking in convolutional networks (Liu et al., 2018) and later extended to query-based multimodal fusion for end-to-end sensorimotor perception and prediction (Halinkovic et al., 28 Jan 2026). QGDF systematically addresses the limitations of rigid, fixed-grid feature aggregation and heuristic modality fusion by permitting fully differentiable, content-adaptive blending of features. The mechanism achieves strong empirical improvements in both deformable target tracking and autonomous driving benchmarks.

1. Mathematical Foundations of QGDF

Given a standard CNN feature map $X \in \mathbb{R}^{H \times W \times C}$ , QGDF learns a dense 2D offset field $\Delta P \in \mathbb{R}^{H \times W \times 2}$ via a shallow CNN $f_\theta(X)$ : $\Delta P = f_\theta(X), \quad \quad \Delta p(i,j) = f_\theta(X)_{ij} \in \mathbb{R}^2.$ Each feature at $p_0 = (i,j)$ is resampled from $X$ at position $p_0 + \Delta p(p_0)$ using bilinear interpolation: $X'(p_0) = F_\mathrm{sample}(X, p_0 + \Delta p(p_0)),$ with

$F_\mathrm{sample}(X, d) = \sum_{x=1}^W \sum_{y=1}^H G([x, y], d) \cdot X(x, y),$

where $G$ denotes standard separable bilinear interpolation.

In transformer-based multimodal architectures, QGDF operates on a set of queries $Q_t \in \mathbb{R}^{B \times N_q \times E}$ , fusing multi-view camera features $\{I^\ell\}$ and LiDAR BEV features $P$ at each decoder layer. The mechanism includes three key submodules:

Masked Attention Aggregation: Feature-point-wise bilinear sampling on the camera feature pyramid, followed by masked softmax-weighted aggregation (per-view, per-query).
Deformable BEV Sampling: Per-query local offsets $\Delta_P$ are predicted to adapt the sampling location for LiDAR BEV features.
Query-Conditioned Gating: Learned per-query, per-branch softmax weights $\gamma_t$ gate the aggregated image and LiDAR representations before final projection.

2. Gating Mechanisms and Fusion Equation

a. Gating in CNN-Based Tracking

A soft gate $g \in \mathbb{R}^{H \times W}$ is predicted per spatial location: $g = \sigma(F_\mathrm{gate}(X))$ where $F_\mathrm{gate}$ is a two-layer MLP applied on $X$ . The fused representation at each location is: $Y = g \odot X' + (1 - g) \odot X,$ with $\odot$ denoting element-wise multiplication.

For queries in transformer decoder layers, gating is performed as: $g_t = \mathrm{FFN}_\mathrm{gate}([\bar Q_t^I, \bar Q_t^L, Q_t]) \in \mathbb{R}^{B \times N_q \times 2},$

$\gamma_{t, b, q, k} = \frac{\exp(g_{t, b, q, k})}{\sum_{k'=0}^1 \exp(g_{t, b, q, k'})}$

Gated modality-specific features are then: $\hat Q_t^I = \gamma_{t, ..., 0} \cdot \bar Q_t^I, \quad \hat Q_t^L = \gamma_{t, ..., 1} \cdot \bar Q_t^L,$

$\tilde Q_t = \mathrm{FFN}_\mathrm{fuse}([\hat Q_t^I, \hat Q_t^L]).$

3. Architectural Integration and Training Protocols

Backbone: VGG-M conv1–3 (output $X$ ).
QGDF Block: Three-way split—Path A passes $X$ , Path B applies deformable sampling, Path C predicts $g$ .
Fusion: As above, followed by classifier head (three FC layers for foreground/background discrimination) and bounding-box regression.
Training:
- Three phases: baseline, add deformable branch, add gating branch and end-to-end fusion.
- Offline pretraining: 200k SGD iterations on OTB/VOT with stratified IoU sampling.
- Online: Backbone frozen, gate+FC fine-tuned per video, online updates via hard negative mining.

QGDF Location: At each decoder layer before cross-attention.
Parameters: Embedding $E=256$ , camera views $N_\mathrm{cam}=6$ , FPN levels $L=3$ , BEV channels $C_L=256$ .
Differentiability: All components (sampling, masking, gating) are differentiable; integrated into joint classification, detection, and forecasting loss via gradient backpropagation.

4. Detailed Algorithmic Procedure

Input:
    Q_t      # (B, N_q, E) queries
    r        # (B, N_q, 3) normalized ref points
    {I^ℓ}    # multi-view FPN features, ℓ=1...L
    P        # (B, C_L, H_p, W_p) LiDAR BEV

S ← Sample({I^ℓ}, r)
M ← ValidityMask({I^ℓ}, r)
ω^I ← FFN_I(Q_t)
α^I ← masked_softmax(ω^I, M)
F^I ← Σ_{n,ℓ} α^I[b,n,ℓ,q] * S[b,n,ℓ,q,:]
barQ^I_t ← LayerNorm(W^I_proj * F^I)

P' ← LayerNorm(Conv1x1(P))
ΔP ← FFN_P(Q_t)
g_0 ← map_to_grid(r[..., :2])
g ← clip(g_0 + s*tanh(ΔP), -1, 1)
S^L ← GridSample(P', g)
F^L ← (1/P) * Σ_p S^L[b,q,p,:]
barQ^L_t ← LayerNorm(W^L_proj * F^L)

g_t ← FFN_gate(concat([barQ^I_t, barQ^L_t, Q_t]))
γ_t ← softmax(g_t, dim=2)
hatQ^I_t ← γ_t[...,0].unsqueeze(-1) * barQ^I_t
hatQ^L_t ← γ_t[...,1].unsqueeze(-1) * barQ^L_t
tildeQ_t ← FFN_fuse(concat([hatQ^I_t, hatQ^L_t]))

P_t ← PE(inverse_sigmoid(r))
Q_{t+1} ← tildeQ_t + Q_t + P_t

5. Empirical Results and Component Ablations

OTB-2013 AUC: Baseline $0.701$, +deformable $0.702$, +gate $0.711$.
Deformation subset: $0.697 \to 0.712$ after deformable conv, maintained with gating.
Deform-SOT: GDT with QGDF outperforms part-based approaches on all evaluated challenges.
VOT-2016/2017: Maintains high accuracy, strong robustness, and top EAO across configurations.

nuScenes:
- EPA: $0.335$ (Li-ViP3D++ w/ QGDF) vs. $0.250$ (prior baseline).
- mAP: $0.502$ vs. $0.472$.
- FP ratio: $0.147$ vs. $0.221$.
Component Analysis:
- Removing masked attention increases FP ratio and drops EPA by 8 points.
- Disabling offsets in BEV sampling loses $\sim$ 5 mAP points.
- Removing gating yields intermediate values but no configuration matches full QGDF.

Interpretation: Each submodule—content-aware camera aggregation, adaptive BEV deformation, and per-query gating—contributes additively to reducing false positives and improving prediction performance.

6. Implementation and Computational Considerations

GDT (Liu et al., 2018): Efficient runtime via single bilinear sampler for deformations, small MLP for gate prediction, overall tracking speed $\approx 1.33$ FPS (GTX1080Ti).
Li-ViP3D++ (Halinkovic et al., 28 Jan 2026): End-to-end, all operations implemented with differentiable primitives (bilinear sampling, FFNs) in standard deep learning frameworks; runtime per frame $139.82\ \mathrm{ms}$ , less than prior non-QGDF variant.
Overhead: QGDF module introduces minimal computational burden compared to benefits in accuracy and robustness.

7. Significance and Distinctiveness

QGDF represents a principled departure from static or ad-hoc fusion mechanisms. It enables dynamic, instance- and context-adaptive feature blending, leveraging both spatial deformation (to handle local appearance/misalignment) and gating (to control modal contributions). The approach ensures full differentiability for end-to-end learning in both object tracking and multimodal sensorimotor prediction, leading to measurable gains in robustness, false positive reduction, and alignment with ground-truth semantics across challenging, deformable, and multi-sensor domains (Liu et al., 2018, Halinkovic et al., 28 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Deformable Object Tracking with Gated Fusion (2018)

Li-ViP3D++: Query-Gated Deformable Camera-LiDAR Fusion for End-to-End Perception and Trajectory Prediction (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Query-Gated Deformable Fusion (QGDF).

Query-Gated Deformable Fusion (QGDF)

1. Mathematical Foundations of QGDF

a. Deformable Convolutional Fusion (Liu et al., 2018)

b. Query-Based Multimodal Fusion (Halinkovic et al., 28 Jan 2026)

2. Gating Mechanisms and Fusion Equation

a. Gating in CNN-Based Tracking

3. Architectural Integration and Training Protocols

a. GDT Tracker (Liu et al., 2018)

b. Li-ViP3D++ (Halinkovic et al., 28 Jan 2026)

4. Detailed Algorithmic Procedure

5. Empirical Results and Component Ablations

a. Deformable Tracking (Liu et al., 2018)

b. Autonomous Driving (Halinkovic et al., 28 Jan 2026)

6. Implementation and Computational Considerations

7. Significance and Distinctiveness

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Query-Gated Deformable Fusion (QGDF)

1. Mathematical Foundations of QGDF

a. Deformable Convolutional Fusion (Liu et al., 2018)

b. Query-Based Multimodal Fusion (Halinkovic et al., 28 Jan 2026)

2. Gating Mechanisms and Fusion Equation

a. Gating in CNN-Based Tracking

b. Query-Conditioned Gating for Multi-Modal Transformers

3. Architectural Integration and Training Protocols

a. GDT Tracker (Liu et al., 2018)

b. Li-ViP3D++ (Halinkovic et al., 28 Jan 2026)

4. Detailed Algorithmic Procedure

5. Empirical Results and Component Ablations

a. Deformable Tracking (Liu et al., 2018)

b. Autonomous Driving (Halinkovic et al., 28 Jan 2026)

6. Implementation and Computational Considerations

7. Significance and Distinctiveness

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics