Residue Number System Winograd

Updated 13 February 2026

Residue Number System (RNS) Winograd is a reformulation of the classical Winograd convolution, embedding modular arithmetic and CRT reconstruction to enable efficient low-precision integer computations.
It translates conventional Winograd transforms into modular operations using 8–16 bit arithmetic, supporting larger transform tiles without sacrificing model accuracy.
Empirical evaluations show significant throughput improvements on modern hardware by reducing multiplication counts and mitigating floating-point instability.

Residue Number System (RNS) Winograd convolution is a reformulation of the classic Winograd minimal-filtering convolution algorithm for efficient, exact, low-precision integer computation, leveraging the parallel, carry-free properties of the Residue Number System. Motivated by the limitations of floating-point Winograd implementations for low-bitwidth convolutional neural network (CNN) inference, RNS-Winograd enables large transform tiles (e.g., 10×10 to 16×16) and near-minimal arithmetic complexity with 8–16 bit integer operations and no sacrifice in prediction accuracy. By embedding all Winograd steps into modular arithmetic over several small, pairwise-coprime moduli, and reconstructing full-precision results via the Chinese Remainder Theorem (CRT), RNS-Winograd achieves substantial throughput improvements on modern low-precision hardware (Liu et al., 2020).

1. The Classic Winograd Minimal-Filtering Algorithm

Winograd minimal filtering, denoted as $F(M\times M, R\times R)$ , computes the convolution of an $R\times R$ filter $g$ with an input patch $d$ to yield an $M\times M$ output tile $Y$ . Let $N = M + R - 1$ . The Winograd convolution in matrix terms is:

$Y = A^\top [ (G g G^\top) \circ (B^\top d B) ] A$

where $G$ (size $N\times R$ ) and $B^\top$ (size $N\times N$ ) are the filter and data transform matrices, $A^\top$ (size $M\times N$ ) the inverse transform matrix, and $\circ$ denotes element-wise multiplication.

Operationally:

Filters are transformed: $\hat{W} = G g G^\top$ .
Inputs are transformed: $\hat{D} = B^\top d B$ .
Element-wise multiplication: $\hat{S} = \hat{W} \circ \hat{D}$ .
The result is inverse transformed: $Y = A^\top \hat{S} A$ .

Winograd reduces arithmetic costs. Direct convolution requires $M^2 R^2$ multiplications, Winograd just $N^2$ per tile, yielding a reduction factor $(M^2R^2)/(M+R-1)^2$ . For example, $F(2\times2,3\times3)$ gives a $2.25\times$ reduction.

2. Residue Number System (RNS) and CRT Reconstruction

An RNS is defined by $k$ pairwise-coprime moduli $\{ m_1, ..., m_k \}$ with total dynamic range $M = \prod_i m_i$ . Each integer $x \in [0, M-1]$ is represented by its residues $\{ x_1, ..., x_k \}$ , where $x_i \equiv x \pmod{m_i}$ .

Arithmetic in RNS is performed independently in each channel (modulus):

Addition/multiplication: $(x \pm y)_i \equiv (x_i \pm y_i) \pmod{m_i}$ , $(x \cdot y)_i \equiv (x_i \cdot y_i) \pmod{m_i}$ .

CRT reconstruction retrieves $x$ from its residues: For each $m_i$ , let $M_i = M/m_i$ and $y_i \equiv M_i^{-1} \pmod{m_i}$ . Then,

$x \equiv \sum_{i=1}^k x_i M_i y_i \pmod{M}$

This enables fully parallel, carry-free, low-precision computation suitable for quantized neural network inference.

3. Mapping Winograd Transforms to RNS

To embed Winograd into RNS, all transform matrices ( $A^\top$ , $B^\top$ , $G$ ) must be expressed in integer or modular form. Rational entries are cleared via a least common multiple $\alpha$ of all denominators (e.g., $G = (1/\alpha) G'$ with integer $G'$ ). Each modulus $m_i$ must be coprime to $\alpha$ to permit modular inversion:

$G_{(m_i)} = (\alpha^{-1} \pmod{m_i}) \cdot (G' \pmod{m_i})$
Analogously for $B^\top_{(m_i)},\,A^\top_{(m_i)}$

This process ensures all per-modulus transforms can be conducted with 8- or 16-bit integer arithmetic within each channel, so the full Winograd procedure is mapped to efficient, low-precision modular operations.

4. End-to-End RNS–Winograd Convolution Process

Given quantized INT8 filters and activations and RNS base $\{m_i\}$ , the algorithm proceeds channel-wise as follows:

Forward Filter Transform (per residue channel $i$ ):

$\hat{W}^{(i)} = G_{(m_i)}\,g\,G_{(m_i)}^\top \pmod{m_i}$

Can be precomputed per filter.

Forward Data Transform:

$\hat{D}^{(i)} = B^\top_{(m_i)}\,d\,B_{(m_i)} \pmod{m_i}$

Pointwise Multiplication:

$\hat{S}^{(i)} = \hat{W}^{(i)} \circ \hat{D}^{(i)} \pmod{m_i}$

Inverse Output Transform:

$Y^{(i)} = A^\top_{(m_i)}\,\hat{S}^{(i)}\,A_{(m_i)} \pmod{m_i}$

CRT Reconstruction: For each output element $y[p,q]$ , recover

$y[p,q] = CRT( \{ Y^{(i)}[p,q] \} )$

The forward data transform is highly amortized, as patches are reused across channels. Depthwise summations and pointwise multiplies can leverage optimized INT8/INT16 GEMM kernels at each modulus (Liu et al., 2020).

5. Complexity Reduction and Arithmetic Analysis

For output tiles $M\times M$ , filters $R\times R$ , tile size $N = M+R-1$ , and $n$ RNS channels, when transform and CRT costs are amortized, the dominant cost is $nN^2$ multiplies per output tile, compared to $M^2R^2$ for direct convolution. The theoretical speedup is:

$\mathrm{Speedup} \approx \frac{M^2R^2}{nN^2}$

Representative reduction factors are shown in the following table:

Filter $R$	Tile $M$	$n=3$ (8-bit)	$n=2$ (16-bit)
$3$	$4$	$1.33\times$	$2.00\times$
$3$	$6$	$1.69\times$	$2.53\times$
$3$	$12$	$2.20\times$	$3.31\times$
$5$	$10$	$4.25\times$	$6.38\times$
$5$	$12$	$4.69\times$	$7.03\times$

In practice, $M$ is selected in the range $8$–$16$ to balance transform cost, reconstruction overhead, and GEMM efficiency.

6. Empirical Performance and Accuracy

Experiments confirm that RNS–Winograd delivers significant acceleration without impairing model accuracy:

On Arm Cortex-A73 CPU, 8-bit RNS–Winograd ( $M=14$ , $3\times3$ filters, $n=3$ ) yields $2.02\times$ speedup over INT8 im2col+GEMM baseline for VGG16, with no loss in ImageNet Top-1 accuracy (71.4%).
16-bit RNS–Winograd ( $n=2$ ) achieves $\sim 2.2\times$ speedup over INT16 baseline.
For $5\times5$ filters (Inception-v3), $F(10\times 10, 5\times 5)$ in 8-bit RNS achieves up to $2.31\times$ speedup.
Transform and CRT/mixed-radix overheads are modest: forward transforms ( $\sim7.9\%$ ), output transforms ( $\sim9.2\%$ ), CRT ( $\sim1.1\%$ ) (Liu et al., 2020).
Arithmetic reduction for large tiles ( $M=12$ , $R=5$ , $n=3$ ) reaches $4.69\times$ .

Empirical validation across VGG16, ResNet50, and Inception (v1/v3) shows no observable top-1 accuracy loss.

7. Implementation Guidelines and Efficacy

RNS–Winograd eliminates numerical instability inherent in large-tile Winograd with FP32 by confining all computation to integer modular domains. The approach is well-suited for modern accelerators supporting wide low-precision GEMMs and parallel SIMD execution. Implementation steps are:

Select appropriate Winograd tile $M$ and RNS moduli $\{ m_i \}$ , ensuring the product covers dynamic range and each $m_i$ is coprime with $\alpha$ (clearance of transform denominators).
Precompute transformed matrices $A^\top_{(m_i)},\,B^\top_{(m_i)},\,G_{(m_i)}$ mod $m_i$ .
For each layer: pre-transform filters, process input patches, multiply in the Winograd domain channel-wise, and reconstruct results by CRT.
Aggregate into batched GEMMs for throughput.

RNS–Winograd thus recovers the arithmetic and wall-clock gains of classic minimal filtering, but mapped to robust, tractable, low-precision integer arithmetic (Liu et al., 2020). This approach removes common sources of Winograd instability, provides a parallelism-friendly substrate, and requires only small increases in transformation overhead relative to the potential acceleration. The method is empirically validated at scale for modern CNNs without accuracy degradation.

Markdown Report Issue Upgrade to Chat

References (1)

Efficient Residue Number System Based Winograd Convolution (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Residue Number System (RNS) Winograd.