Papers
Topics
Authors
Recent
Search
2000 character limit reached

Query-Gated Deformable Fusion (QGDF)

Updated 4 February 2026
  • The paper introduces QGDF, which integrates learned spatial deformation with query-conditioned gating to adaptively blend feature representations.
  • It employs deformable convolutional sampling and masked attention to dynamically fuse multi-view camera and LiDAR BEV features.
  • Empirical evaluations demonstrate improved tracking accuracy and robust autonomous driving performance with minimal additional computational overhead.

Query-Gated Deformable Fusion (QGDF) is a unified architectural mechanism for adaptively integrating feature representations—across spatial locations and modalities—by combining learned spatial deformation with query- or location-conditioned soft gating. It was originally introduced for deformable object tracking in convolutional networks (Liu et al., 2018) and later extended to query-based multimodal fusion for end-to-end sensorimotor perception and prediction (Halinkovic et al., 28 Jan 2026). QGDF systematically addresses the limitations of rigid, fixed-grid feature aggregation and heuristic modality fusion by permitting fully differentiable, content-adaptive blending of features. The mechanism achieves strong empirical improvements in both deformable target tracking and autonomous driving benchmarks.

1. Mathematical Foundations of QGDF

Given a standard CNN feature map XRH×W×CX \in \mathbb{R}^{H \times W \times C}, QGDF learns a dense 2D offset field ΔPRH×W×2\Delta P \in \mathbb{R}^{H \times W \times 2} via a shallow CNN fθ(X)f_\theta(X): ΔP=fθ(X),Δp(i,j)=fθ(X)ijR2.\Delta P = f_\theta(X), \quad \quad \Delta p(i,j) = f_\theta(X)_{ij} \in \mathbb{R}^2. Each feature at p0=(i,j)p_0 = (i,j) is resampled from XX at position p0+Δp(p0)p_0 + \Delta p(p_0) using bilinear interpolation: X(p0)=Fsample(X,p0+Δp(p0)),X'(p_0) = F_\mathrm{sample}(X, p_0 + \Delta p(p_0)), with

Fsample(X,d)=x=1Wy=1HG([x,y],d)X(x,y),F_\mathrm{sample}(X, d) = \sum_{x=1}^W \sum_{y=1}^H G([x, y], d) \cdot X(x, y),

where GG denotes standard separable bilinear interpolation.

In transformer-based multimodal architectures, QGDF operates on a set of queries QtRB×Nq×EQ_t \in \mathbb{R}^{B \times N_q \times E}, fusing multi-view camera features {I}\{I^\ell\} and LiDAR BEV features PP at each decoder layer. The mechanism includes three key submodules:

  1. Masked Attention Aggregation: Feature-point-wise bilinear sampling on the camera feature pyramid, followed by masked softmax-weighted aggregation (per-view, per-query).
  2. Deformable BEV Sampling: Per-query local offsets ΔP\Delta_P are predicted to adapt the sampling location for LiDAR BEV features.
  3. Query-Conditioned Gating: Learned per-query, per-branch softmax weights γt\gamma_t gate the aggregated image and LiDAR representations before final projection.

2. Gating Mechanisms and Fusion Equation

a. Gating in CNN-Based Tracking

A soft gate gRH×Wg \in \mathbb{R}^{H \times W} is predicted per spatial location: g=σ(Fgate(X))g = \sigma(F_\mathrm{gate}(X)) where FgateF_\mathrm{gate} is a two-layer MLP applied on XX. The fused representation at each location is: Y=gX+(1g)X,Y = g \odot X' + (1 - g) \odot X, with \odot denoting element-wise multiplication.

b. Query-Conditioned Gating for Multi-Modal Transformers

For queries in transformer decoder layers, gating is performed as: gt=FFNgate([QˉtI,QˉtL,Qt])RB×Nq×2,g_t = \mathrm{FFN}_\mathrm{gate}([\bar Q_t^I, \bar Q_t^L, Q_t]) \in \mathbb{R}^{B \times N_q \times 2},

γt,b,q,k=exp(gt,b,q,k)k=01exp(gt,b,q,k)\gamma_{t, b, q, k} = \frac{\exp(g_{t, b, q, k})}{\sum_{k'=0}^1 \exp(g_{t, b, q, k'})}

Gated modality-specific features are then: Q^tI=γt,...,0QˉtI,Q^tL=γt,...,1QˉtL,\hat Q_t^I = \gamma_{t, ..., 0} \cdot \bar Q_t^I, \quad \hat Q_t^L = \gamma_{t, ..., 1} \cdot \bar Q_t^L,

Q~t=FFNfuse([Q^tI,Q^tL]).\tilde Q_t = \mathrm{FFN}_\mathrm{fuse}([\hat Q_t^I, \hat Q_t^L]).

3. Architectural Integration and Training Protocols

  • Backbone: VGG-M conv1–3 (output XX).
  • QGDF Block: Three-way split—Path A passes XX, Path B applies deformable sampling, Path C predicts gg.
  • Fusion: As above, followed by classifier head (three FC layers for foreground/background discrimination) and bounding-box regression.
  • Training:
    • Three phases: baseline, add deformable branch, add gating branch and end-to-end fusion.
    • Offline pretraining: 200k SGD iterations on OTB/VOT with stratified IoU sampling.
    • Online: Backbone frozen, gate+FC fine-tuned per video, online updates via hard negative mining.
  • QGDF Location: At each decoder layer before cross-attention.
  • Parameters: Embedding E=256E=256, camera views Ncam=6N_\mathrm{cam}=6, FPN levels L=3L=3, BEV channels CL=256C_L=256.
  • Differentiability: All components (sampling, masking, gating) are differentiable; integrated into joint classification, detection, and forecasting loss via gradient backpropagation.

4. Detailed Algorithmic Procedure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Input:
    Q_t      # (B, N_q, E) queries
    r        # (B, N_q, 3) normalized ref points
    {I^ℓ}    # multi-view FPN features, ℓ=1...L
    P        # (B, C_L, H_p, W_p) LiDAR BEV

S  Sample({I^ℓ}, r)
M  ValidityMask({I^ℓ}, r)
ω^I  FFN_I(Q_t)
α^I  masked_softmax(ω^I, M)
F^I  Σ_{n,ℓ} α^I[b,n,ℓ,q] * S[b,n,ℓ,q,:]
barQ^I_t  LayerNorm(W^I_proj * F^I)

P' ← LayerNorm(Conv1x1(P))
ΔP  FFN_P(Q_t)
g_0  map_to_grid(r[..., :2])
g  clip(g_0 + s*tanh(ΔP), -1, 1)
S^L  GridSample(P', g)
F^L  (1/P) * Σ_p S^L[b,q,p,:]
barQ^L_t  LayerNorm(W^L_proj * F^L)

g_t  FFN_gate(concat([barQ^I_t, barQ^L_t, Q_t]))
γ_t  softmax(g_t, dim=2)
hatQ^I_t  γ_t[...,0].unsqueeze(-1) * barQ^I_t
hatQ^L_t  γ_t[...,1].unsqueeze(-1) * barQ^L_t
tildeQ_t  FFN_fuse(concat([hatQ^I_t, hatQ^L_t]))

P_t  PE(inverse_sigmoid(r))
Q_{t+1}  tildeQ_t + Q_t + P_t

5. Empirical Results and Component Ablations

  • OTB-2013 AUC: Baseline $0.701$, +deformable $0.702$, +gate $0.711$.
  • Deformation subset: 0.6970.7120.697 \to 0.712 after deformable conv, maintained with gating.
  • Deform-SOT: GDT with QGDF outperforms part-based approaches on all evaluated challenges.
  • VOT-2016/2017: Maintains high accuracy, strong robustness, and top EAO across configurations.
  • nuScenes:
    • EPA: $0.335$ (Li-ViP3D++ w/ QGDF) vs. $0.250$ (prior baseline).
    • mAP: $0.502$ vs. $0.472$.
    • FP ratio: $0.147$ vs. $0.221$.
  • Component Analysis:
    • Removing masked attention increases FP ratio and drops EPA by 8 points.
    • Disabling offsets in BEV sampling loses \sim5 mAP points.
    • Removing gating yields intermediate values but no configuration matches full QGDF.

Interpretation: Each submodule—content-aware camera aggregation, adaptive BEV deformation, and per-query gating—contributes additively to reducing false positives and improving prediction performance.

6. Implementation and Computational Considerations

  • GDT (Liu et al., 2018): Efficient runtime via single bilinear sampler for deformations, small MLP for gate prediction, overall tracking speed 1.33\approx 1.33 FPS (GTX1080Ti).
  • Li-ViP3D++ (Halinkovic et al., 28 Jan 2026): End-to-end, all operations implemented with differentiable primitives (bilinear sampling, FFNs) in standard deep learning frameworks; runtime per frame 139.82 ms139.82\ \mathrm{ms}, less than prior non-QGDF variant.
  • Overhead: QGDF module introduces minimal computational burden compared to benefits in accuracy and robustness.

7. Significance and Distinctiveness

QGDF represents a principled departure from static or ad-hoc fusion mechanisms. It enables dynamic, instance- and context-adaptive feature blending, leveraging both spatial deformation (to handle local appearance/misalignment) and gating (to control modal contributions). The approach ensures full differentiability for end-to-end learning in both object tracking and multimodal sensorimotor prediction, leading to measurable gains in robustness, false positive reduction, and alignment with ground-truth semantics across challenging, deformable, and multi-sensor domains (Liu et al., 2018, Halinkovic et al., 28 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Query-Gated Deformable Fusion (QGDF).