Residue Number System Winograd
- Residue Number System (RNS) Winograd is a reformulation of the classical Winograd convolution, embedding modular arithmetic and CRT reconstruction to enable efficient low-precision integer computations.
- It translates conventional Winograd transforms into modular operations using 8–16 bit arithmetic, supporting larger transform tiles without sacrificing model accuracy.
- Empirical evaluations show significant throughput improvements on modern hardware by reducing multiplication counts and mitigating floating-point instability.
Residue Number System (RNS) Winograd convolution is a reformulation of the classic Winograd minimal-filtering convolution algorithm for efficient, exact, low-precision integer computation, leveraging the parallel, carry-free properties of the Residue Number System. Motivated by the limitations of floating-point Winograd implementations for low-bitwidth convolutional neural network (CNN) inference, RNS-Winograd enables large transform tiles (e.g., 10×10 to 16×16) and near-minimal arithmetic complexity with 8–16 bit integer operations and no sacrifice in prediction accuracy. By embedding all Winograd steps into modular arithmetic over several small, pairwise-coprime moduli, and reconstructing full-precision results via the Chinese Remainder Theorem (CRT), RNS-Winograd achieves substantial throughput improvements on modern low-precision hardware (Liu et al., 2020).
1. The Classic Winograd Minimal-Filtering Algorithm
Winograd minimal filtering, denoted as , computes the convolution of an filter with an input patch to yield an output tile . Let . The Winograd convolution in matrix terms is:
where (size ) and (size ) are the filter and data transform matrices, (size ) the inverse transform matrix, and denotes element-wise multiplication.
Operationally:
- Filters are transformed: .
- Inputs are transformed: .
- Element-wise multiplication: .
- The result is inverse transformed: .
Winograd reduces arithmetic costs. Direct convolution requires multiplications, Winograd just per tile, yielding a reduction factor . For example, gives a reduction.
2. Residue Number System (RNS) and CRT Reconstruction
An RNS is defined by pairwise-coprime moduli with total dynamic range . Each integer is represented by its residues , where .
Arithmetic in RNS is performed independently in each channel (modulus):
- Addition/multiplication: , .
CRT reconstruction retrieves from its residues: For each , let and . Then,
This enables fully parallel, carry-free, low-precision computation suitable for quantized neural network inference.
3. Mapping Winograd Transforms to RNS
To embed Winograd into RNS, all transform matrices (, , ) must be expressed in integer or modular form. Rational entries are cleared via a least common multiple of all denominators (e.g., with integer ). Each modulus must be coprime to to permit modular inversion:
- Analogously for
This process ensures all per-modulus transforms can be conducted with 8- or 16-bit integer arithmetic within each channel, so the full Winograd procedure is mapped to efficient, low-precision modular operations.
4. End-to-End RNS–Winograd Convolution Process
Given quantized INT8 filters and activations and RNS base , the algorithm proceeds channel-wise as follows:
- Forward Filter Transform (per residue channel ):
Can be precomputed per filter.
- Forward Data Transform:
- Pointwise Multiplication:
- Inverse Output Transform:
- CRT Reconstruction: For each output element , recover
The forward data transform is highly amortized, as patches are reused across channels. Depthwise summations and pointwise multiplies can leverage optimized INT8/INT16 GEMM kernels at each modulus (Liu et al., 2020).
5. Complexity Reduction and Arithmetic Analysis
For output tiles , filters , tile size , and RNS channels, when transform and CRT costs are amortized, the dominant cost is multiplies per output tile, compared to for direct convolution. The theoretical speedup is:
Representative reduction factors are shown in the following table:
| Filter | Tile | (8-bit) | (16-bit) |
|---|---|---|---|
| $3$ | $4$ | ||
| $3$ | $6$ | ||
| $3$ | $12$ | ||
| $5$ | $10$ | ||
| $5$ | $12$ |
In practice, is selected in the range $8$–$16$ to balance transform cost, reconstruction overhead, and GEMM efficiency.
6. Empirical Performance and Accuracy
Experiments confirm that RNS–Winograd delivers significant acceleration without impairing model accuracy:
- On Arm Cortex-A73 CPU, 8-bit RNS–Winograd (, filters, ) yields speedup over INT8 im2col+GEMM baseline for VGG16, with no loss in ImageNet Top-1 accuracy (71.4%).
- 16-bit RNS–Winograd () achieves speedup over INT16 baseline.
- For filters (Inception-v3), in 8-bit RNS achieves up to speedup.
- Transform and CRT/mixed-radix overheads are modest: forward transforms (), output transforms (), CRT () (Liu et al., 2020).
- Arithmetic reduction for large tiles (, , ) reaches .
Empirical validation across VGG16, ResNet50, and Inception (v1/v3) shows no observable top-1 accuracy loss.
7. Implementation Guidelines and Efficacy
RNS–Winograd eliminates numerical instability inherent in large-tile Winograd with FP32 by confining all computation to integer modular domains. The approach is well-suited for modern accelerators supporting wide low-precision GEMMs and parallel SIMD execution. Implementation steps are:
- Select appropriate Winograd tile and RNS moduli , ensuring the product covers dynamic range and each is coprime with (clearance of transform denominators).
- Precompute transformed matrices mod .
- For each layer: pre-transform filters, process input patches, multiply in the Winograd domain channel-wise, and reconstruct results by CRT.
- Aggregate into batched GEMMs for throughput.
RNS–Winograd thus recovers the arithmetic and wall-clock gains of classic minimal filtering, but mapped to robust, tractable, low-precision integer arithmetic (Liu et al., 2020). This approach removes common sources of Winograd instability, provides a parallelism-friendly substrate, and requires only small increases in transformation overhead relative to the potential acceleration. The method is empirically validated at scale for modern CNNs without accuracy degradation.