Q-DiT: PTQ for Diffusion Transformers
- Q-DiT is a post-training quantization framework tailored for Diffusion Transformers to achieve efficient low-precision inference.
- It employs group-wise evolutionary search and dynamic activation quantization to address pronounced channel and temporal variance.
- Q-DiT achieves superior 6-/4-bit quantization performance, reducing model size while significantly maintaining image generation quality.
Q-DiT refers to "Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers"—a post-training quantization (PTQ) framework tailored specifically for Diffusion Transformers (DiTs), targeting efficient deployment under severe memory and computational constraints. Its design directly addresses architectural properties unique to DiTs, notably pronounced channel-wise and temporal variance in both weights and activations. Q-DiT introduces automatic quantization granularity assignment and dynamic, sample-wise activation quantization, establishing new benchmarks in 6-/4-bit quantization for state-of-the-art generative models (Chen et al., 2024).
1. Quantization Challenges in Diffusion Transformers
Quantizing DiTs is significantly more challenging than UNet-based diffusion architectures due to two forms of variance:
- Spatial (input-channel) variance: In DiTs, variance across input channels in weight matrices is often an order of magnitude larger than across output channels. This manifests in certain input channels—predominantly in transformer QKV projections—dominating dynamic range, which renders standard per-tensor or per-output-channel quantization suboptimal. If the quantization scale is set to accommodate large-magnitude channels, the majority of other channels suffer precision loss due to coarse discretization.
- Temporal variance in activations: During sampling, activations within diffusion models exhibit strong dynamic-range shifts as a function of the denoising timestep . Outlier spikes and magnitude distributions evolve markedly between early and late sampling steps. Any static or averaged quantization scale cannot capture this variance, resulting in either overflow/clipping or under-utilized quantization levels for many timesteps.
The combined effect is that conventional PTQ frameworks, tuned for uniform or per-tensor scale allocation, exhibit severe performance degradation (catastrophic FID increase, visual artifacts) when directly applied to DiTs.
2. Automatic Quantization Granularity Allocation
Q-DiT introduces a fine-grained, group-wise quantization method, where the per-layer quantization granularity is tuned to match each transformer's channel structure:
- Partitioning along input channels: For a given weight matrix , input channels are partitioned into groups of size . Within each group, min-max statistics are computed to derive scale and zero-point for uniform quantization.
- Group size selection as a hyperparameter: The group size is not fixed globally. Instead, for each layer, is selected from , balancing fidelity against compute/memory cost.
- Evolutionary search: Q-DiT employs an FID-guided search over group-size vectors across all layers, jointly optimizing final model quality (FID) under a given BitOps constraint. The population is evolved through crossover and mutation, prioritizing candidates yielding the best FID subject to the global compute/memory target.
This approach enables Q-DiT to avoid both over-aggressive quantization (which increases error for high-dynamic-range channels) and unnecessary granularity (which would negate memory/computation savings).
3. Sample-wise Dynamic Activation Quantization
To tackle step-dependent activation variance, Q-DiT uses adaptive, sample-wise activation quantization at inference:
- Dynamic min-max scaling: For each group in every activation tensor, at each denoising step and for each sample, the scale (max absolute value) is recomputed just before quantization. This ensures that quantization adapts to the specific activation distribution encountered at each step and for each example.
- Group-wise quantization symmetry: Activation groups are aligned to the grouping used in weight quantization, minimizing mismatch in activation–weight pairing errors.
- Minimal compute overhead: The dynamic scale computation is fused into preceding matrix multiplication or normalization operations, with negligible per-inference cost relative to the overall transformer computation.
This method ensures that dynamic range and quantization precision are always appropriate to the instantaneous activation statistics, directly reducing overflows and under-representation for time-evolving diffusion layers.
4. End-to-End Quantization Workflow
Q-DiT structures quantization through the following pipeline:
- Calibration and group-search phase
- A small calibration set (e.g., 512 images) is assembled.
- Baseline model is quantized using candidate group-size configurations. FID is measured on generated samples, and the best configuration (optimal ) is selected.
- Weight quantization
- Each layer’s weights are quantized in groups of size , with per-group scales and zero-points computed using min/max statistics and GPTQ-style second-order rounding for error minimization.
- Activation quantization (inference phase)
- For each sample , denoising step , and activation group , is calculated on-the-fly.
- Activations are quantized and dequantized immediately prior to subsequent layer computations.
- Sampling/generation
Q-DiT provides the following high-level inference logic:
1 2 3 4 5 6 7 8 9 |
for sample in batch: for t in denoising_steps: for layer in transformer_layers: for group in layer.groups: # Dynamic quantization scale per group alpha = compute_max_abs(activation[group]) quant_act = quantize(activation[group], alpha) dequant_act = dequantize(quant_act, alpha) # Proceed to next layer using quantized activations |
5. Experimental Results and Empirical Gains
Q-DiT has been benchmarked predominantly on DiT-XL/2 (2.2B parameters) for image generation tasks (ImageNet 256Ă—256 and 512Ă—512):
| Bit-width | Method | Model Size | FID (↓) | sFID (↓) | IS (↑) | Precision (↑) |
|---|---|---|---|---|---|---|
| 16/16 | FP16 | 1349 MB | 12.40 | 19.11 | 116.68 | 0.6605 |
| 6/8 | GPTQ | 690 MB | 14.12 | 21.00 | 108.5 | 0.62 |
| 6/8 | Q-DiT | 683 MB | 12.10 | 19.02 | 115.2 | 0.66 |
| 4/8 | GPTQ | 351 MB | 25.48 | 25.57 | 73.46 | 0.539 |
| 4/8 | Q-DiT | 347 MB | 15.76 | 19.84 | 98.78 | 0.640 |
For W6A8, Q-DiT closes the gap to floating point (FP16) by a factor of 1.7 (ΔFID ≈ -0.3), while at W4A8 it outperforms all prior PTQ baselines by 10 FID points (vs. GPTQ). For higher resolutions and challenging settings, similar trends are observed, with Q-DiT achieving both 2×–4Ă— model size reductions and minimal fidelity loss (Chen et al., 2024).
Ablations reveal that group-wise quantization alone is insufficient—dynamic activation quantization provides an additional significant quality advantage.
6. Implementation Details and Practical Recommendations
- Architecture compatibility: Q-DiT is implemented for ViT-style DiT-XL/2 with patch size 2, hidden dimension 1536, and 64 transformer blocks; all MLP and attention weights are quantized with layer-wise, group-optimized scales.
- Calibration size: Only 512 diverse calibration images are necessary; no end-to-end task-specific fine-tuning is required.
- Activation quantization: Done fully dynamically; there is no need to precompute or store timestep-dependent or sample-dependent scales.
- Efficiency: Memory is reduced by 2Ă— (W8A8) to 4Ă— (W4A8) relative to FP16, and integer matrix multiply throughput is fully leveraged for inference acceleration.
- Extensions: The method is straightforward to adapt to other diffusion architectures with temporally varying statistics or to video DiTs (with only slight adjustments).
7. Comparative Position and Limitations
Relative to other DiT quantization frameworks (including PTQ4DiT, TQ-DiT, ViDiT-Q, TaQ-DiT, HQ-DiT, LRQ-DiT):
- Q-DiT is distinguished by its evolutionary, FID-minimizing group assignment (no fixed heuristic), and its pure sample-wise dynamic quantization for activations (versus per-timestep or per-chunk calibration).
- Unlike methods using log-domain or adaptive rotations (e.g., TLQ, ARS), Q-DiT relies purely on group-wise uniform scales, yielding hardware simplicity.
- Limitations include: exclusive focus on per-group uniform quantization (no log or non-uniform quantization), and no fine-tuned mixed-precision or per-layer exception handling.
Q-DiT does not yet report extension to pure video DiTs with temporally or spatially structured attention, but the authors suggest only minor adaptation would be required.
Key attributes: group-wise scale assignment optimized by image-level FID, dynamic quantization scale on a per-sample and per-group basis, full integration into the standard DDIM/DDPM transformer-based diffusion workflow, and demonstrated best-in-class quantized DiT performance at low precision (Chen et al., 2024).