Papers
Topics
Authors
Recent
Search
2000 character limit reached

Residue Number System Winograd

Updated 13 February 2026
  • Residue Number System (RNS) Winograd is a reformulation of the classical Winograd convolution, embedding modular arithmetic and CRT reconstruction to enable efficient low-precision integer computations.
  • It translates conventional Winograd transforms into modular operations using 8–16 bit arithmetic, supporting larger transform tiles without sacrificing model accuracy.
  • Empirical evaluations show significant throughput improvements on modern hardware by reducing multiplication counts and mitigating floating-point instability.

Residue Number System (RNS) Winograd convolution is a reformulation of the classic Winograd minimal-filtering convolution algorithm for efficient, exact, low-precision integer computation, leveraging the parallel, carry-free properties of the Residue Number System. Motivated by the limitations of floating-point Winograd implementations for low-bitwidth convolutional neural network (CNN) inference, RNS-Winograd enables large transform tiles (e.g., 10×10 to 16×16) and near-minimal arithmetic complexity with 8–16 bit integer operations and no sacrifice in prediction accuracy. By embedding all Winograd steps into modular arithmetic over several small, pairwise-coprime moduli, and reconstructing full-precision results via the Chinese Remainder Theorem (CRT), RNS-Winograd achieves substantial throughput improvements on modern low-precision hardware (Liu et al., 2020).

1. The Classic Winograd Minimal-Filtering Algorithm

Winograd minimal filtering, denoted as F(M×M,R×R)F(M\times M, R\times R), computes the convolution of an R×RR\times R filter gg with an input patch dd to yield an M×MM\times M output tile YY. Let N=M+R1N = M + R - 1. The Winograd convolution in matrix terms is:

Y=A[(GgG)(BdB)]AY = A^\top [ (G g G^\top) \circ (B^\top d B) ] A

where GG (size N×RN\times R) and BB^\top (size N×NN\times N) are the filter and data transform matrices, AA^\top (size M×NM\times N) the inverse transform matrix, and \circ denotes element-wise multiplication.

Operationally:

  • Filters are transformed: W^=GgG\hat{W} = G g G^\top.
  • Inputs are transformed: D^=BdB\hat{D} = B^\top d B.
  • Element-wise multiplication: S^=W^D^\hat{S} = \hat{W} \circ \hat{D}.
  • The result is inverse transformed: Y=AS^AY = A^\top \hat{S} A.

Winograd reduces arithmetic costs. Direct convolution requires M2R2M^2 R^2 multiplications, Winograd just N2N^2 per tile, yielding a reduction factor (M2R2)/(M+R1)2(M^2R^2)/(M+R-1)^2. For example, F(2×2,3×3)F(2\times2,3\times3) gives a 2.25×2.25\times reduction.

2. Residue Number System (RNS) and CRT Reconstruction

An RNS is defined by kk pairwise-coprime moduli {m1,...,mk}\{ m_1, ..., m_k \} with total dynamic range M=imiM = \prod_i m_i. Each integer x[0,M1]x \in [0, M-1] is represented by its residues {x1,...,xk}\{ x_1, ..., x_k \}, where xix(modmi)x_i \equiv x \pmod{m_i}.

Arithmetic in RNS is performed independently in each channel (modulus):

  • Addition/multiplication: (x±y)i(xi±yi)(modmi)(x \pm y)_i \equiv (x_i \pm y_i) \pmod{m_i}, (xy)i(xiyi)(modmi)(x \cdot y)_i \equiv (x_i \cdot y_i) \pmod{m_i}.

CRT reconstruction retrieves xx from its residues: For each mim_i, let Mi=M/miM_i = M/m_i and yiMi1(modmi)y_i \equiv M_i^{-1} \pmod{m_i}. Then,

xi=1kxiMiyi(modM)x \equiv \sum_{i=1}^k x_i M_i y_i \pmod{M}

This enables fully parallel, carry-free, low-precision computation suitable for quantized neural network inference.

3. Mapping Winograd Transforms to RNS

To embed Winograd into RNS, all transform matrices (AA^\top, BB^\top, GG) must be expressed in integer or modular form. Rational entries are cleared via a least common multiple α\alpha of all denominators (e.g., G=(1/α)GG = (1/\alpha) G' with integer GG'). Each modulus mim_i must be coprime to α\alpha to permit modular inversion:

  • G(mi)=(α1(modmi))(G(modmi))G_{(m_i)} = (\alpha^{-1} \pmod{m_i}) \cdot (G' \pmod{m_i})
  • Analogously for B(mi),A(mi)B^\top_{(m_i)},\,A^\top_{(m_i)}

This process ensures all per-modulus transforms can be conducted with 8- or 16-bit integer arithmetic within each channel, so the full Winograd procedure is mapped to efficient, low-precision modular operations.

4. End-to-End RNS–Winograd Convolution Process

Given quantized INT8 filters and activations and RNS base {mi}\{m_i\}, the algorithm proceeds channel-wise as follows:

  1. Forward Filter Transform (per residue channel ii):

W^(i)=G(mi)gG(mi)(modmi)\hat{W}^{(i)} = G_{(m_i)}\,g\,G_{(m_i)}^\top \pmod{m_i}

Can be precomputed per filter.

  1. Forward Data Transform:

D^(i)=B(mi)dB(mi)(modmi)\hat{D}^{(i)} = B^\top_{(m_i)}\,d\,B_{(m_i)} \pmod{m_i}

  1. Pointwise Multiplication:

S^(i)=W^(i)D^(i)(modmi)\hat{S}^{(i)} = \hat{W}^{(i)} \circ \hat{D}^{(i)} \pmod{m_i}

  1. Inverse Output Transform:

Y(i)=A(mi)S^(i)A(mi)(modmi)Y^{(i)} = A^\top_{(m_i)}\,\hat{S}^{(i)}\,A_{(m_i)} \pmod{m_i}

  1. CRT Reconstruction: For each output element y[p,q]y[p,q], recover

y[p,q]=CRT({Y(i)[p,q]})y[p,q] = CRT( \{ Y^{(i)}[p,q] \} )

The forward data transform is highly amortized, as patches are reused across channels. Depthwise summations and pointwise multiplies can leverage optimized INT8/INT16 GEMM kernels at each modulus (Liu et al., 2020).

5. Complexity Reduction and Arithmetic Analysis

For output tiles M×MM\times M, filters R×RR\times R, tile size N=M+R1N = M+R-1, and nn RNS channels, when transform and CRT costs are amortized, the dominant cost is nN2nN^2 multiplies per output tile, compared to M2R2M^2R^2 for direct convolution. The theoretical speedup is:

SpeedupM2R2nN2\mathrm{Speedup} \approx \frac{M^2R^2}{nN^2}

Representative reduction factors are shown in the following table:

Filter RR Tile MM n=3n=3 (8-bit) n=2n=2 (16-bit)
$3$ $4$ 1.33×1.33\times 2.00×2.00\times
$3$ $6$ 1.69×1.69\times 2.53×2.53\times
$3$ $12$ 2.20×2.20\times 3.31×3.31\times
$5$ $10$ 4.25×4.25\times 6.38×6.38\times
$5$ $12$ 4.69×4.69\times 7.03×7.03\times

In practice, MM is selected in the range $8$–$16$ to balance transform cost, reconstruction overhead, and GEMM efficiency.

6. Empirical Performance and Accuracy

Experiments confirm that RNS–Winograd delivers significant acceleration without impairing model accuracy:

  • On Arm Cortex-A73 CPU, 8-bit RNS–Winograd (M=14M=14, 3×33\times3 filters, n=3n=3) yields 2.02×2.02\times speedup over INT8 im2col+GEMM baseline for VGG16, with no loss in ImageNet Top-1 accuracy (71.4%).
  • 16-bit RNS–Winograd (n=2n=2) achieves 2.2×\sim 2.2\times speedup over INT16 baseline.
  • For 5×55\times5 filters (Inception-v3), F(10×10,5×5)F(10\times 10, 5\times 5) in 8-bit RNS achieves up to 2.31×2.31\times speedup.
  • Transform and CRT/mixed-radix overheads are modest: forward transforms (7.9%\sim7.9\%), output transforms (9.2%\sim9.2\%), CRT (1.1%\sim1.1\%) (Liu et al., 2020).
  • Arithmetic reduction for large tiles (M=12M=12, R=5R=5, n=3n=3) reaches 4.69×4.69\times.

Empirical validation across VGG16, ResNet50, and Inception (v1/v3) shows no observable top-1 accuracy loss.

7. Implementation Guidelines and Efficacy

RNS–Winograd eliminates numerical instability inherent in large-tile Winograd with FP32 by confining all computation to integer modular domains. The approach is well-suited for modern accelerators supporting wide low-precision GEMMs and parallel SIMD execution. Implementation steps are:

  1. Select appropriate Winograd tile MM and RNS moduli {mi}\{ m_i \}, ensuring the product covers dynamic range and each mim_i is coprime with α\alpha (clearance of transform denominators).
  2. Precompute transformed matrices A(mi),B(mi),G(mi)A^\top_{(m_i)},\,B^\top_{(m_i)},\,G_{(m_i)} mod mim_i.
  3. For each layer: pre-transform filters, process input patches, multiply in the Winograd domain channel-wise, and reconstruct results by CRT.
  4. Aggregate into batched GEMMs for throughput.

RNS–Winograd thus recovers the arithmetic and wall-clock gains of classic minimal filtering, but mapped to robust, tractable, low-precision integer arithmetic (Liu et al., 2020). This approach removes common sources of Winograd instability, provides a parallelism-friendly substrate, and requires only small increases in transformation overhead relative to the potential acceleration. The method is empirically validated at scale for modern CNNs without accuracy degradation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Residue Number System (RNS) Winograd.