Quantized DETR (Q-DETR) Overview
- The paper introduces a quantization method for DETR that uses information-theoretic and Fisher-based criteria to preserve accuracy with low-bit representations.
- It leverages distribution rectification distillation and foreground-aware query matching to align quantized queries with the full-precision teacher model.
- The approach employs Fisher-aware mixed precision and a two-stage training process with QAT and Fisher-trace regularization, significantly reducing model size and computation.
Quantized DETR (Q-DETR) denotes a family of methods that enable efficient, low-bit quantization of the Detection Transformer (DETR) architecture and its variants, targeting significant reductions in computation and memory requirements while minimizing performance degradation. Core challenges center on information loss in quantized attention and regression modules, as well as disparate quantization sensitivities across category labels. Modern Q-DETR approaches employ information-theoretic and Fisher information-based criteria to adaptively allocate precision and stabilize key representations, thereby achieving state-of-the-art accuracy under tight quantization constraints (Xu et al., 2023, Yang et al., 2024).
1. Quantization Challenges in DETR-based Architectures
The DETR framework introduces end-to-end object detection using transformers, with a sequence of object queries interacting via cross-attention with image-encoded features. Standard quantization-aware training (QAT) of DETR typically yields substantial accuracy drops, especially at ≤4-bit precision. This degradation is most severe in the decoder's attention layers due to their pivotal role in maintaining the fidelity of object query distributions. Empirically, quantized queries exhibit distorted distributions—misaligning attention maps and leading to substantial mean average precision (mAP) reductions. The impact is highly non-uniform, with certain object categories (e.g., persons, animals) demonstrating heightened sensitivity to quantization due to sharper local loss landscapes and increased risk of overfitting post-QAT (Xu et al., 2023, Yang et al., 2024).
2. Information Bottleneck Distillation and Distribution Rectification
A central innovation is the distribution rectification distillation (DRD) scheme grounded in the Information Bottleneck (IB) principle. Treating the real-valued DETR as a teacher and the quantized model as a student, the distillation objective is formalized as:
where denotes mutual information, is the input image, the student encoder output, and the student/teacher queries, and the ground-truth detection set. The critical query alignment term is recast as a bi-level problem. At the inner level, the query distribution self-entropy is maximized via Gaussian normalization:
with , trainable and . At the upper level, a foreground-aware query matching (FQM) mechanism filters and aligns student and teacher queries based on high generalized IoU (GIoU) overlap with ground-truth boxes. The conditional entropy minimization is operationalized via matching of co-attended features:
The final training objective is the sum of the standard DETR loss and the DRD penalty (Xu et al., 2023).
3. Fisher-aware Mixed Precision and Critical-category Objectives
An alternative paradigm formulates quantization as a sensitivity assignment problem leveraging Fisher information. The Hessian of the loss with respect to network parameters is approximated by the diagonal Fisher information matrix. Per-layer sensitivity is quantified via the trace entries , accumulated as squared gradient norms from a calibration set. A mixed-precision allocation is then determined by solving the integer linear program:
where is the per-layer bit-width, the float weights, the quantized weights, and the total bit budget. To address fine-grained application needs, a composite sensitivity metric
weights the standard DETR loss and a critical-category collapse loss , where all non-critical class logits are merged to ensure the optimization protects task-relevant categories (Yang et al., 2024).
4. Training Regimes: Quantization-aware Training and Fisher-trace Regularization
A two-stage training protocol is employed. Initially, per-layer precisions are allocated using the Fisher-aware scheme (either on overall or critical loss). Subsequently, quantization-aware training (QAT) fine-tunes the entire network, applying the quantizer in each forward pass and propagating gradients using a straight-through estimator. To further suppress overfitting and enhance generalization on critical classes, Fisher-trace regularization is introduced during QAT:
Here, the trace is computed diagonally, and is annealed from low to high over 50 epochs (e.g., from to ), first permitting loss convergence then penalizing sharp/minima by flattening the Fisher landscape. Empirical results demonstrate that this schedule curbs late-stage overfitting without sacrificing mean mAP (Yang et al., 2024).
5. Implementation and Engineering Considerations
Implementation employs symmetric per-layer uniform quantization, with each linear and attention module quantized as a group, and biases/LayerNorm left in floating-point. In Q-DETR (as specified in (Xu et al., 2023)), the LSQ quantizer with channel-wise scaling is favored, and attention activations are kept at 8 bits for stability. Both DETR and its improved variants (Deformable DETR, DAB-DETR, SMCA-DETR) are supported. Training schedules typically involve hundreds of epochs, with final Q-DETR models demonstrating model sizes 4–16× smaller and computational costs 6–16× lower than their full-precision counterparts. Only weights are mixed-precision quantized; activations can optionally be group-quantized but are not focus of the core methodology. No clamping is performed in weight scaling; quantizer scale is always the per-layer maximum absolute value.
6. Empirical Performance and Ablations
Consistent empirical findings across benchmarks (COCO, VOC) show competitive or superior accuracy at extreme quantization. For instance, with DETR-R50 on COCO, the 4-4-8 Q-DETR achieves 39.4% AP (versus 42.0% for real-valued) with a speedup and smaller model (Xu et al., 2023). Fisher-aware critical-category mixed-precision improves person mAP in Deformable DETR on COCO Panoptic from 28.9% (uniform 4-bit) to 43.1%. QAT combined with Fisher-trace regularization further raises person mAP from 35.56% to 35.75%, while maintaining overall mAP at ≈37%. Ablation studies reveal that the combination of distribution alignment and foreground-aware query matching is essential for best accuracy, and that a carefully annealed Fisher-trace penalty is required to prevent overfitting on critical classes (Yang et al., 2024).
| Model | Quantization | Model Size (MB) | Ops (G) | AP | Speedup |
|---|---|---|---|---|---|
| DETR-R50 (FP32) | 32-32-32 | 159.3 | 85.5 | 42.0 | 1.0× |
| Q-DETR (4-4-8) | 4-4-8 | 19.9 | 13.0 | 39.4 | 6.6× |
| Q-DETR (3-3-8) | 3-3-8 | 15.0 | 7.6 | — | 11.2× |
In the hardest COCO Panoptic classes, Fisher-aware mixed-precision quantization and Fisher-trace QAT raise critical-class mAP by +10–14% absolute over uniform-precision quantization (Yang et al., 2024).
7. Best Practices and Research Significance
For practitioners, inclusion of a critical-class loss in sensitivity metrics is emphasized for tasks with label imbalance or application-specific objectives. Diagonal Fisher-based sensitivity estimation is advocated for scalability to large transformer models. A two-stage pipeline—first solving the Fisher ILP for bit-width selection, then QAT with Fisher-trace regularization—is found to yield >90% floating-point mAP in most scenarios while drastically reducing model size. Quantizing only post-QAT the final classification/regression heads is discouraged; they should remain in float during PTQ and be trained under QAT.
These approaches collectively advance quantized object detection, enabling DETR-based models to operate on resource-constrained devices without severe deterioration of detection accuracy, especially for critical categories (Xu et al., 2023, Yang et al., 2024).