Optimal precision allocation for deterministic quantization

Determine the optimal per-layer precision allocation for mixed-precision integer arithmetic quantization in transformer inference (for example, assigning INT16 precision to attention mechanisms and INT8 precision to feed-forward layers), and characterize how the required precision interacts with context length, so as to minimize calibration degradation while preserving deterministic inference guarantees.

Background

The paper implements deterministic transformer inference using integer arithmetic with INT8 weights and Q16 activations, achieving cross-platform bitwise identical outputs. However, INT8 quantization introduces calibration degradation, especially in attention computations where small dot-product differences can affect token selection. The authors propose mixed-precision schemes (e.g., INT16 for attention, INT8 for feed-forward layers) as a path to reduce this gap while maintaining determinism.

They explicitly identify as open the task of determining how to distribute precision across layers and how this allocation depends on context length. Solving this would directly impact production deployments by enabling better quality-speed-memory trade-offs under the determinism constraint.

References

The optimal precision allocation per layer type and the interaction between precision and context length are open optimization problems with direct impact on production deployment.

On the Foundations of Trustworthy Artificial Intelligence  (2603.24904 - Dunham, 26 Mar 2026) in Section 12. Open Problems — Higher-precision deterministic quantization