GOAP Algorithm for Sparse Neural Networks
- GOAP algorithm is a hardware-oriented method that precomputes nonzero mappings to efficiently target sparse convolution and fully connected layers.
- By eliminating zero-weight operations and dynamic fetches, GOAP achieves deterministic latency, improved throughput, and predictable resource utilization.
- Integrated within the SAOCDS architecture, GOAP facilitates energy-efficient streaming for spiking neural network accelerators in edge applications.
The Gated One-to-All Product (GOAP) algorithm is a hardware-oriented computational technique for accelerating sparse convolutional and fully connected layers, especially in neuromorphic and event-driven inference networks. By leveraging precomputed mappings between nonzero kernel weights and all potential output indices, GOAP enables direct accumulation only where both input feature map (IFM) and weight are nonzero, eliminating extraneous operations and dynamic fetches. GOAP is implemented in the Sparsity-Aware Output-Channel Dataflow Streaming (SAOCDS) architecture, which is utilized for streaming spiking neural network (SNN) accelerators in edge applications such as automatic modulation classification (AMC) (Yang et al., 6 Jan 2026).
1. Algorithmic Definition and Operation
The GOAP algorithm replaces the standard sliding-window (SW) computation in convolutional or fully connected layers with a dataflow that operates only over nonzero kernel weights and their associated input indices. Instead of sliding every K×K kernel over each output pixel position and performing a full inner product, GOAP identifies the set of nonzero weights (indexed by ) and, for each, determines all output locations to which that weight contributes—the so-called "enable map" . The per-weight computational flow is:
- Weight-Driven Mapping: For each nonzero weight , precompute where ’s receptive field covers the output.
- Sparse Accumulation: At inference, for each and each , fetch (the relevant IFM value) from the input buffer and, if the value is nonzero (for SNNs, typically binary spikes), perform the accumulation:
where is the membrane potential for output channel at position .
All index decoding, IFM access, and kernel traversals are resolved statically, requiring no conditional control at runtime.
2. Data Structures and Precomputation
GOAP exploits the static nature of kernel sparsity for fixed-weight networks (e.g., post-training or for SNN inference). Each weight is associated with:
- : quantized (possibly binary) nonzero value
- : packed index for input/output channel
- : spatial (kernel offset) index
The enable-map for each is computed offline for every possible output pixel index that the convolutional coverage condition is met. All kernel weight positions and metadata are thus embedded in ROM or on-chip SRAM. This permits:
- Elimination of dynamic zero-weight fetches: If is zero, there is neither fetching nor computation.
- Schedule Regularity: All iterated accumulations, "empty" (input channel with no nonzeros), or "extra" (output channel with no nonzeros) iterations are counted and scheduled in advance. The control path is a static loop of total length .
3. Static Dataflow and Runtime Execution
The streaming datapath is strictly output-channel ordered. For each layer:
- All processing elements (PEs) are statically assigned output channels.
- Three pipeline stages are performed per PE: neuron state update, gated accumulation for each nonzero weight, and spike thresholding.
- Inter-layer handoff uses lightweight per-channel FIFOs, and the absence of conditional branches in runtime control ensures every clock cycle performs a predetermined task (accumulation, state update, or IFM advance).
A schedule pseudocode reflecting GOAP’s one-to-all product principle is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
For t in 0…T–1 do IC_read ← 0; pre_oc ← OC; oc ← 0 nnz ← 0 For reps in 0…REPS–1 do nnz_oc ← floor(W[nnz].RI/IC) next_oc ← floor(W[nnz+1].RI/IC) if IC_read < IC: read IFM channel ic; IC_read++ if nnz_oc ≠ oc: output spike[oc]; decay & store V[oc] oc++; pre_oc ← oc continue ic ← W[nnz].RI mod IC if ic < IC_read: if oc ≠ pre_oc: load & decay V[oc] for oi in EM(W[nnz]): if I[ic][oi+W[nnz].CI]==1: V[oc][oi] += W[nnz].D if next_oc ≠ oc: output spike[oc]; store V[oc] oc++; pre_oc ← oc nnz++ |
4. Quantitative Complexity and Efficiency
A complexity comparison against sliding-window convolutional execution quantifies the GOAP benefit:
| Computation | Sliding Window (SW) | GOAP (Typical SNN, Bin) |
|---|---|---|
| Weight Fetches | 96 | 12 |
| Input Fetches | 24 | 48 |
| Accumulations | 48 | 24 |
GOAP thus reduces both weight fetches and total accumulations proportional to the kernel’s spatial sparsity. As kernel sparsity increases (up to 90%), the accumulation count ratio descends linearly: at 50% sparsity, only ~50% of the original accumulations are required (Yang et al., 6 Jan 2026).
5. Architectural Integration in SAOCDS
In the SAOCDS accelerator, GOAP is embedded within the Matrix-Vector Threshold Unit (MVTU) of each layer, paired with small per-layer input buffers and weight storage in sparse coordinate format. Control logic for the accretion pipeline is minimized— iteration tracking, per-output-channel FIFOs, and static index counters replace global routers and dynamic data fetch scheduling. This design yields fully pipelined, control-free execution across all layers, maximizing throughput and minimizing energy per sample. For example, on FPGA, a 23.5 MS/s throughput is achieved on RadioML-2016 (2× the FINN baseline), with up to 80% energy reduction at high sparsity, and negligible classification accuracy loss (Yang et al., 6 Jan 2026).
6. Comparison with Alternative Acceleration Methods
Traditional systolic arrays utilize mesh-connected homogeneous PEs with global routing and dynamic scheduling; while they exploit sparsity via runtime metadata, this imposes additional control latency, interconnect congestion, and typically necessitates explicit memory accesses for each operation. In contrast, standard streaming architectures (e.g., FINN streaming) instantiate each neural layer with local weight storage and point-to-point FIFOs, allowing for high throughput but failing to exploit spatial sparsity—zero weights still incur redundant fetch and MAC activity.
GOAP within SAOCDS uniquely provides:
- Precise spatial sparsity utilization: Only nonzero IFM-weight interactions trigger MAC operations.
- Control-free, high-throughput pipeline: Fully static schedule and no branching ensure that each cycle performs work with no stalling.
- Low hardware overhead: Only a modest increase in LUT utilization (+11% vs. FINN) for significant reductions in BRAM and global control hardware at comparable or improved accuracy (Yang et al., 6 Jan 2026).
7. Impact and Applicability
The GOAP algorithm and its embedding in the SAOCDS architecture enable practical SNN and other sparse network deployments for edge signal processing and real-time classification. The precomputed one-to-all mapping permits linear scaling of sparsity benefit with minimal penalty to area and control complexity. Power and throughput improvements make this approach viable for stringent edge and embedded settings. Empirically, a 2× throughput boost, up to 5× energy reduction, and only minor hardware resource increases are achieved for SNN-based AMC tasks, all while essentially preserving classification accuracy (Yang et al., 6 Jan 2026).