Papers
Topics
Authors
Recent
Search
2000 character limit reached

GOAP Algorithm for Sparse Neural Networks

Updated 13 January 2026
  • GOAP algorithm is a hardware-oriented method that precomputes nonzero mappings to efficiently target sparse convolution and fully connected layers.
  • By eliminating zero-weight operations and dynamic fetches, GOAP achieves deterministic latency, improved throughput, and predictable resource utilization.
  • Integrated within the SAOCDS architecture, GOAP facilitates energy-efficient streaming for spiking neural network accelerators in edge applications.

The Gated One-to-All Product (GOAP) algorithm is a hardware-oriented computational technique for accelerating sparse convolutional and fully connected layers, especially in neuromorphic and event-driven inference networks. By leveraging precomputed mappings between nonzero kernel weights and all potential output indices, GOAP enables direct accumulation only where both input feature map (IFM) and weight are nonzero, eliminating extraneous operations and dynamic fetches. GOAP is implemented in the Sparsity-Aware Output-Channel Dataflow Streaming (SAOCDS) architecture, which is utilized for streaming spiking neural network (SNN) accelerators in edge applications such as automatic modulation classification (AMC) (Yang et al., 6 Jan 2026).

1. Algorithmic Definition and Operation

The GOAP algorithm replaces the standard sliding-window (SW) computation in convolutional or fully connected layers with a dataflow that operates only over nonzero kernel weights and their associated input indices. Instead of sliding every K×K kernel over each output pixel position and performing a full inner product, GOAP identifies the set of nonzero weights (indexed by nznz) and, for each, determines all output locations to which that weight contributes—the so-called "enable map" EM(w)EM(w). The per-weight computational flow is:

  1. Weight-Driven Mapping: For each nonzero weight ww, precompute EM(w){all output pixel indices oi}EM(w)\subset\{\text{all output pixel indices }oi\} where ww’s receptive field covers the output.
  2. Sparse Accumulation: At inference, for each ww and each oiEM(w)oi \in EM(w), fetch I[ci][oi+CI]I[ci][oi+CI] (the relevant IFM value) from the input buffer and, if the value is nonzero (for SNNs, typically binary spikes), perform the accumulation:

V[oc][oi]    V[oc][oi]+w.DiffI[ic][oi+WI]=1V[oc][oi]\; \leftarrow \; V[oc][oi] + w.D \quad \text{iff} \quad I[ic][oi+WI] = 1

where V[oc][oi]V[oc][oi] is the membrane potential for output channel ococ at position oioi.

All index decoding, IFM access, and kernel traversals are resolved statically, requiring no conditional control at runtime.

2. Data Structures and Precomputation

GOAP exploits the static nature of kernel sparsity for fixed-weight networks (e.g., post-training or for SNN inference). Each weight is associated with:

  • W[nz].DW[nz].D: quantized (possibly binary) nonzero value
  • W[nz].RIW[nz].RI: packed index for input/output channel
  • W[nz].CIW[nz].CI: spatial (kernel offset) index

The enable-map EM(w)EM(w) for each ww is computed offline for every possible output pixel index oioi that the convolutional coverage condition is met. All kernel weight positions and metadata are thus embedded in ROM or on-chip SRAM. This permits:

  • Elimination of dynamic zero-weight fetches: If ww is zero, there is neither fetching nor computation.
  • Schedule Regularity: All iterated accumulations, "empty" (input channel with no nonzeros), or "extra" (output channel with no nonzeros) iterations are counted and scheduled in advance. The control path is a static loop of total length REPS=NNZ+#emptyI+#extraIREPS = NNZ + \#\text{empty}_I + \#\text{extra}_I.

3. Static Dataflow and Runtime Execution

The streaming datapath is strictly output-channel ordered. For each layer:

  • All processing elements (PEs) are statically assigned output channels.
  • Three pipeline stages are performed per PE: neuron state update, gated accumulation for each nonzero weight, and spike thresholding.
  • Inter-layer handoff uses lightweight per-channel FIFOs, and the absence of conditional branches in runtime control ensures every clock cycle performs a predetermined task (accumulation, state update, or IFM advance).

A schedule pseudocode reflecting GOAP’s one-to-all product principle is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
For t in 0T1 do
  IC_read  0; pre_oc  OC; oc  0
  nnz  0
  For reps in 0REPS1 do
    nnz_oc     floor(W[nnz].RI/IC)
    next_oc    floor(W[nnz+1].RI/IC)
    if IC_read < IC:
      read IFM channel ic; IC_read++
    if nnz_oc  oc:
      output spike[oc]; decay & store V[oc]
      oc++; pre_oc  oc
      continue
    ic  W[nnz].RI mod IC
    if ic < IC_read:
      if oc  pre_oc:
        load & decay V[oc]
      for oi in EM(W[nnz]):
        if I[ic][oi+W[nnz].CI]==1:
          V[oc][oi] += W[nnz].D
      if next_oc  oc:
        output spike[oc]; store V[oc]
        oc++; pre_oc  oc
      nnz++
This static loop mechanism ensures that memory access and compute units operate at full utilization without dynamic branching, yielding deterministic latency and predictable resource needs (Yang et al., 6 Jan 2026).

4. Quantitative Complexity and Efficiency

A complexity comparison against sliding-window convolutional execution quantifies the GOAP benefit:

Computation Sliding Window (SW) GOAP (Typical SNN, Bin)
Weight Fetches 96 12
Input Fetches 24 48
Accumulations 48 24

GOAP thus reduces both weight fetches and total accumulations proportional to the kernel’s spatial sparsity. As kernel sparsity increases (up to 90%), the accumulation count ratio descends linearly: at 50% sparsity, only ~50% of the original accumulations are required (Yang et al., 6 Jan 2026).

5. Architectural Integration in SAOCDS

In the SAOCDS accelerator, GOAP is embedded within the Matrix-Vector Threshold Unit (MVTU) of each layer, paired with small per-layer input buffers and weight storage in sparse coordinate format. Control logic for the accretion pipeline is minimized— iteration tracking, per-output-channel FIFOs, and static index counters replace global routers and dynamic data fetch scheduling. This design yields fully pipelined, control-free execution across all layers, maximizing throughput and minimizing energy per sample. For example, on FPGA, a 23.5 MS/s throughput is achieved on RadioML-2016 (2× the FINN baseline), with up to 80% energy reduction at high sparsity, and negligible classification accuracy loss (Yang et al., 6 Jan 2026).

6. Comparison with Alternative Acceleration Methods

Traditional systolic arrays utilize mesh-connected homogeneous PEs with global routing and dynamic scheduling; while they exploit sparsity via runtime metadata, this imposes additional control latency, interconnect congestion, and typically necessitates explicit memory accesses for each operation. In contrast, standard streaming architectures (e.g., FINN streaming) instantiate each neural layer with local weight storage and point-to-point FIFOs, allowing for high throughput but failing to exploit spatial sparsity—zero weights still incur redundant fetch and MAC activity.

GOAP within SAOCDS uniquely provides:

  • Precise spatial sparsity utilization: Only nonzero IFM-weight interactions trigger MAC operations.
  • Control-free, high-throughput pipeline: Fully static schedule and no branching ensure that each cycle performs work with no stalling.
  • Low hardware overhead: Only a modest increase in LUT utilization (+11% vs. FINN) for significant reductions in BRAM and global control hardware at comparable or improved accuracy (Yang et al., 6 Jan 2026).

7. Impact and Applicability

The GOAP algorithm and its embedding in the SAOCDS architecture enable practical SNN and other sparse network deployments for edge signal processing and real-time classification. The precomputed one-to-all mapping permits linear scaling of sparsity benefit with minimal penalty to area and control complexity. Power and throughput improvements make this approach viable for stringent edge and embedded settings. Empirically, a 2× throughput boost, up to 5× energy reduction, and only minor hardware resource increases are achieved for SNN-based AMC tasks, all while essentially preserving classification accuracy (Yang et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated One-to-All Product (GOAP) Algorithm.