Integer-Arithmetic-Only Inference
- Integer-arithmetic-only inference is a quantization approach that maps real values to integers using scale and zero-point, ensuring efficient computation.
- It replaces floating-point operations with integer matrix multiplications, bit-shifts, and clamping, yielding significant speedups and memory reductions.
- This method supports applications in deep neural networks, tree ensembles, spiking neural networks, and logical reasoning on specialized hardware.
Integer-arithmetic-only inference refers to the complete elimination of floating-point operations during the deployment (inference) phase of machine learning models, logical inference engines, and probabilistic programs, restricting all computation to integer or fixed-point arithmetic. This paradigm is motivated by the need for efficient, deterministic, and portable inference on hardware platforms that either lack floating-point units (FPUs) or exhibit significant gains in speed, energy usage, or simplicity when restricted to integer operations. Integer-only inference methods are now foundational in quantized deep learning, tree-based models, spiking neural networks, logical/ontological reasoning, and probabilistic inference systems.
1. Fundamental Principles and Quantization
The core of integer-only inference is quantization, typically the affine mapping of real-valued variables to integers using a scale and a zero-point ,
with appropriate clamping to fit in int8 or int32 machine representations. During inference, model parameters (weights, biases, thresholds, feature values) and activations are stored and manipulated as integers, using only additions, multiplications, bit-shift (for scaling or approximating division), comparisons, and clamping.
Quantization choices include:
- Symmetric vs. Asymmetric quantization (zero-point at 0 or nonzero)
- Per-tensor vs. per-channel scaling
- Static (calibration-based) vs. dynamic (data-dependent) scale/zero-point selection
All significant deep learning integer inference techniques—such as those in MobileNets or ResNets—adopt this quantization approach for both weights and activations, yielding 8× reduction in model storage and 2–4× end-to-end speedups on integer-optimized CPUs and MCUs with typical accuracy decreases of percentage points for classification (Jacob et al., 2017, Zhao et al., 2020).
2. Integer-only Inference in Deep Neural Networks
Convolutional and Feedforward Architectures
Integer-only CNN/MLP inference includes:
- Integer GEMM/Convolution: All matrix multiplications use int8×int8→int32 accumulations, fused with bias addition in int32, and then requantized to int8.
- Activation Functions: Standard ReLU is replaced by Bounded ReLU,
with set using the empirical -σ rule (e.g., ) based on float32 statistics to trade off quantization error and clipping. Post-convolutional values are thresholded accordingly (Zhao et al., 2020).
- Nonlinear Functions: For models with complex nonlinearities (GELU, Softmax, LayerNorm), integer-only polynomial or rational approximations, and integer-friendly Newton-Raphson loops, are deployed in int32, followed by requantization (Kim et al., 2021).
For each layer, scale/zero-point pairs propagate forward, dictating the rescaling of integer outputs and preventing stability issues. Residual and concatenation paths require synchronization of scales to support addition or branching (Zhao et al., 2020). BatchNorm is folded into adjacent weights prior to quantization.
Efficient Integer Transformations
Specialized convolution algorithms—e.g., integer-only Winograd transforms—enable further reductions in computation and bit-width by leveraging the algebraic structure of convolution to minimize multiplications. Integer-based per-position down/up scaling keeps dynamic ranges within int8/int9, with negligible loss in accuracy ( top-5) (Meng et al., 2019).
Transformers and Integer-Only Nonlinearities
For architectures like BERT or ViT, all linear projections operate over INT8, while nonlinear sublayers use dedicated integer-only schemes:
- Integer-GELU: Quadratic polynomial approximation of , fitted to the quantized domain.
- Integer-Softmax: Polynomial/rational approximations to in subregions with only shifts for power-of-two scaling.
- Integer-LayerNorm: Integer mean/variance estimation, followed by Newton-like methods for integer square-root. Overall, this permits seamless pipeline execution on integer-only tensor-core or SIMD units, with empirical accuracy matching or even surpassing FP32 baselines and $2.4$– throughput gains on T4 GPUs (Kim et al., 2021, Li et al., 2022).
3. Integer-only Inference for Structured and Tree Models
Decision Trees and Random Forests
Integer-only inference for tree-based models comprises:
- Feature Comparisons: FLInt or equivalent techniques reinterpret IEEE-754 float32 as int32 via bitcast and sign-flip (XOR with $0x80000000$), ensuring that
with proofs showing monotonic order-preservation (Hakert et al., 2022).
- Leaf Value Accumulation: Class posteriors (for ensembles) are stored as fixed-point (e.g., ), and all accumulation/division is performed in uint32, avoiding all floating-point decoding and arithmetic (Bart et al., 21 May 2025).
- Code Generation: InTreeger compiles models to pure C with integer compare/branching and leaf scoring, supporting ARM, x86, and RISC-V targets. Significant cycle-count and energy savings are measured (up to 2.1× faster, 60% energy reduction) (Bart et al., 21 May 2025, Hakert et al., 2022).
This approach guarantees indistinguishable classification from floating-point baselines for practical ensemble sizes () and resolves major embedded hardware constraints.
4. Integer Arithmetic in Non-Neural and Hybrid Models
Spiking Neural Networks (SNNs)
Integer-only inference is achieved by:
- Quantizing membrane potentials and synaptic weights into int8/12/16.
- Implementing all dynamic updates (leak, reset, compare) as integer shift/add/sub/mul, e.g., leak via right shift, threshold compare as integer , reset as subtract.
- Weight quantization is realized by shifts; all synaptic updates furnish hardware-efficient sparse MAC operations.
Evaluation demonstrates accuracy drop for 8/12-bit weights, 60%+ memory savings, and large energy improvements (e.g., int8 mul/add operations are 18–30× less energy-consuming than FP32) (Gomez et al., 8 Sep 2025).
Integer Logical and Probabilistic Inference
Integer arithmetic is exploited in logic-based systems (Bound Datalog) and PPLs:
- Bound Datalog: Integer-only rules and atoms allow efficient, polynomial-time fact entailment under type-consistency and strict arithmetic restrictions, enabling tractable knowledge reasoning with ontological extensions (Berent et al., 2022).
- Probabilistic Programs: Encoding integers as binary vectors and arithmetic as Boolean circuits enables knowledge compilation (e.g., to d-DNNF/SDD) and exact weighted model counting, providing tractable, scalable inference on complex integer-valued models (Cao et al., 2023).
This generalizes integer-only inference to domains outside neural ML, emphasizing broad relevance.
5. Hardware, Implementation, and Empirical Results
Integer-only inference leverages specialized hardware features:
- CPU SIMD/DSP: Modern CPUs possess wide integer vector units; ARM NEON or RISC-V standard extensions offer efficient int8×int8→int32 computations. For fixed-point rescaling, bit-shift replaces division/multiplication by floating-point constants.
- Integer Tensor Cores: Hardware such as NVIDIA Turing provides dedicated INT8 matrix units, enabling integer kernels to outperform mixed-precision variants.
- Microcontrollers (MCUs): Integer-only models deploy natively on devices lacking FPUs, e.g., ARM Cortex-M or RISC-V MCUs, with cycle-count and code-size advantages (Bart et al., 21 May 2025).
- Memory/Latency: Across DNNs, SNNs, and tree ensembles, model memory decreases by –, and inference time is typically halved to quartered compared to FP32 (Jacob et al., 2017, Zhao et al., 2020, Kim et al., 2021, Gomez et al., 8 Sep 2025).
Representative results (from various works) for DNNs/transformers:
| Model | Accuracy Loss | Memory↓ | Inference Speedup |
|---|---|---|---|
| ResNet-18 (ImageNet) | pp | ||
| BERT-base (GLUE) | pp | $2.4$– | |
| InTreeger RF (Shuttle) | $0$ |
Accuracy drop is frequently less than $1.5$ percentage points with quantization-aware fine-tuning.
6. Limitations and Open Directions
Integer-arithmetic-only inference is subject to known limitations:
- Bit-width sensitivity: Reducing activations or weights below 6–8 bits degrades accuracy, especially in deeper models or with complex non-linearities (Zhao et al., 2020, Gomez et al., 8 Sep 2025).
- Quantization Error and Outliers: Rare but large input activations or weights can force large scaling, reducing resolution for most data. Per-channel or per-group calibration and dynamic quantization can mitigate but incur code/complexity cost.
- Nonlinear Operations: Some nonlinearities lack simple integer approximations or require cascades of lookups/polynomials, introducing complexity and implementation overhead (Kim et al., 2021).
- Logic and PPL Domains: Some arithmetic fragments remain undecidable in logical systems if unconstrained (e.g., variable-variable multiplication), forcing tight syntactic restrictions for guaranteed completion (Berent et al., 2022).
- Tree-based Ensembles: For extremely small probabilities or very large ensembles, integer aliasing may be possible, though practical impact is negligible for (Bart et al., 21 May 2025).
Directions for future work include hardware co-design for primitive integer operations in specialized accelerators, dynamic or learned bit allocation, hybrid integer/fixed-point logic for challenging nonlinearities, and extending knowledge compilation/inference pipelines for richer integer arithmetics.
7. Practical Guidelines and Use Cases
Applying integer-only inference requires:
- Careful quantization-aware training or post-training calibration, especially for activations.
- Use of integer-kernels/numerical recipes for all dense, convolutional, and nonlinear layers.
- Ensuring accumulator width suffices to prevent overflow (typically INT32, up to 26–32 effective bits).
- Hardware deployment matching: match model scale and bit-width to target device integer datapath and on-board memory.
Principal use cases include:
- On-device inference for embedded/IoT with strict power/memory/determinism requirements,
- Efficient deployment of neural and tree models on CPUs/GPUs with high-throughput integer units,
- Efficient robust inference under quantization constraints,
- Integration in logical reasoning engines and probabilistic programming to support scalable, discrete probabilistic inference (Bart et al., 21 May 2025, Gomez et al., 8 Sep 2025, Lin et al., 2021, Cao et al., 2023).
Integer-arithmetic-only inference defines a rigorous, hardware-portable and empirically robust framework for efficient AI deployments across modern computing platforms.