Papers
Topics
Authors
Recent
Search
2000 character limit reached

DS-CIM1 Variant: High Accuracy CIM

Updated 17 January 2026
  • The DS-CIM1 variant is a compute-in-memory architecture employing digital stochastic logic and sample region remapping to achieve collision-free OR accumulation.
  • It attains state-of-the-art INT8 inference accuracy of 94.45% on CIFAR-10 with ResNet-18 while maintaining a low RMSE of 0.74%.
  • The design optimizes area, power, and throughput for edge AI using a 128×32 SRAM array, OR16-based MAC units, and shared 2D-partitioned PRNGs.

A High-Accuracy DS-CIM1 Variant is a compute-in-memory (CIM) macro architecture designed to implement matrix-vector multiplications (MVM) for neural network inference with high accuracy, hardware efficiency, and compact area using a digital stochastic logic framework. This architecture emphasizes exact OR-accumulation in stochastic computing by leveraging sample region remapping, enabling mutually exclusive activation of accumulation circuits and eliminating saturation errors commonly associated with traditional OR-based approaches. DS-CIM1 achieves state-of-the-art INT8 inference accuracy (94.45% on CIFAR-10 with ResNet-18) and a root-mean-squared error (RMSE) of 0.74%, using only small bit-stream lengths. The design targets edge AI and low-power hardware accelerators (Shao et al., 10 Jan 2026).

1. Architectural Overview

DS-CIM1 is organized around a 128 × 32 SRAM array to store 8-bit weights for dense parallel MVM operations. The input activations, also in 8-bit precision, are converted to unipolar stochastic bit-streams using 128 stochastic number generators (SNGs) per row. Computation is performed by eight OR16-based unipolar OR-MAC (multiply-accumulate) units per column, with each OR16 block accumulating the products of 16 distinct rows. The OR-MAC16 units are self-contained; each integrates 16 AND gates (for bit-wise multiplication) followed by a shared OR-tree, and drives a digital accumulator. The architecture supports signed MACs by a novel “shifted” data encoding scheme in which 2’s complement inputs x[i]x[i] and w[i]w[i] are mapped to unsigned forms x[i]=x[i]+27x'[i] = x[i] + 2^7, w[i]=w[i]+27w'[i] = w[i] + 2^7, enabling all logic to remain unipolar and unsigned (Shao et al., 10 Jan 2026).

The full signed dot-product is generated through multi-term decomposition: ix[i]w[i]=ix[i]w[i]27ix[i]27iw[i],\sum_i x[i]w[i] = \sum_i x'[i]w'[i] - 2^7\sum_i x[i] - 2^7\sum_i w'[i], where the main term (a) is computed stochastically in hardware, term (c) via small runtime SIMD adders, and term (d) is precomputed and accessed from a look-up table.

2. Stochastic Data Representation and Bit-Stream Computation

The high-accuracy DS-CIM1 variant adopts a stochastic computing (SC) paradigm where each 8-bit activation and weight is represented as an L=256L=256 bit-stream (for high accuracy). The value AA', the unsigned input activation, is mapped into the probability domain by comparing a global PRNG output PRNGA[t]\mathrm{PRNGA}[t] against AA',

ASC[t]={1,PRNGA[t]<A 0,otherwiseA_\mathrm{SC}[t] = \begin{cases}1, & \mathrm{PRNGA}[t] < A'\ 0, & \text{otherwise}\end{cases}

with a similar mapping for w[i]w[i]0 from weights and w[i]w[i]1.

The per-cycle bit-product is

w[i]w[i]2

and the OR-accumulation process over 16 rows produces a cycle-level output w[i]w[i]3. Over all w[i]w[i]4 cycles, w[i]w[i]5 approximates the scaled dot-product: w[i]w[i]6 After accumulation, the digital output is mapped back to the original signed domain by offset correction. This stochastic approach enables hardware simplicity and high parallelism.

3. Sample Region Remapping and Collision-Free OR Accumulation

Traditional OR-based macro architectures in stochastic computing suffer from 1s-saturation: if more than one row produces a 1, the OR gate's output remains 1, distorting the accumulated sum for non-sparse data. The DS-CIM1 variant eliminates this by introducing a 2D sample region remapping mechanism: the global PRNGs for activations and weights define a 2D unit square. The w[i]w[i]7 partition divides this square into 16 mutually exclusive subregions, each assigned to a single row in the OR16 block. Each SNG's output comparator or logic is locally modified (via right shifts and bit-inversions) so that each row can fire only if the PRNG-generated random point falls in its uniquely assigned subregion.

Formally, the probability of OR-collision (two or more rows firing) is: w[i]w[i]8 with w[i]w[i]9. In DS-CIM1, this collision probability is exactly zero, as the remapping guarantees that only one row can be activated per cycle. Thus, each x[i]=x[i]+27x'[i] = x[i] + 2^70 reflects the sum of probabilities without error due to OR collisions. This exactness is preserved even for extremely sparse or dense activation patterns.

4. Quantitative Evaluation and Error Analysis

DS-CIM1 achieves state-of-the-art neural network inference accuracy in low-precision INT8 regimes. On ResNet-18/CIFAR-10, the design yields 94.45% top-1 accuracy (baseline: 94.54%) with x[i]=x[i]+27x'[i] = x[i] + 2^71 bit-streams (Shao et al., 10 Jan 2026). The RMSE (as defined by x[i]=x[i]+27x'[i] = x[i] + 2^72) is measured at 0.74%, representing a 5-fold reduction over previous approximate DCIM (RMSE ≈ 4.03%). Reliability is uniform across all sparsities; error is solely stochastic and decreases as x[i]=x[i]+27x'[i] = x[i] + 2^73. By reducing x[i]=x[i]+27x'[i] = x[i] + 2^74 to 64, accuracy remains at 94.00% while throughput increases 4-fold, though RMSE increases to 3.57%. Competitive area and energy metrics are demonstrated: 669.7 TOPS/W and 117.1 TOPS/mm² at x[i]=x[i]+27x'[i] = x[i] + 2^75 (DS-CIM1 area ≈ 0.78 mm²).

Comparison with pure-digital and pure-analog CIMs shows that while pure-digital approaches may offer higher theoretical TOPS/W, the area cost for sign-extended MVMs is substantially higher and density is lower (e.g., 0.512 Mb/mm² for 22 nm all-digital full-precision CIMs). Pure-analog arrays, by comparison, suffer worse linearity (INL > 1 LSB) and higher worst-case errors to the order of 0.8 % rms (Konno et al., 25 Aug 2025).

5. Hardware Efficiency and Scalability

DS-CIM1 achieves high throughput via architectural replication: each column carries eight OR16 blocks, each handling 16 rows. The use of a shared 2D-partitioned PRNG eliminates the need for per-row PRNGs and enables deterministic, mutually exclusive control, yielding both area and energy efficiency. End-to-end latency is dominated by bit-stream length x[i]=x[i]+27x'[i] = x[i] + 2^76; area overhead is limited to the additional comparators and accumulators required per OR16 macro. Power consumption is dominated by SNGs and accumulators.

This architecture is readily scalable to larger models and higher dimensions. The original work demonstrates successful deployment on INT8 ResNet-50 on ImageNet and FP8 LLaMA-7B models using the same macro design (Shao et al., 10 Jan 2026). Accuracy remains robust as the primary error channel is the controllable stochastic sampling variance.

6. Design Trade-offs and Application Context

Key trade-offs in DS-CIM1 relate to the selection of bit-stream length x[i]=x[i]+27x'[i] = x[i] + 2^77 (accuracy versus throughput), OR-block granularity, and the proportional area allocated to stochastic versus deterministic logic. DS-CIM1’s error is strictly uniform and data-independent due to the collision-free property conferred by sample region partitioning; this guarantees resilience against wide swings in input sparsity, an issue that hinders traditional OR-MAC architectures. The design is optimized for edge AI accelerators and compact neural coprocessors where area, power, and accuracy must be co-optimized.

A plausible implication is that this architecture, by breaking the classical accuracy-throughput bottleneck of stochastic CIM, enables deployment of high-precision neural models to highly resource-constrained environments without requiring complex analog calibration or large area digital accumulators.


Metric DS-CIM1 High-Accuracy Previous Approximate DCIM Pure-Analog CIM (typ.)
RMSE 0.74% ≈4.03% ≈0.8%
Density (Mb/mm²) 1.80 0.512 Lower
TOPS/W (max) 35.0–669.7 Higher (theor.) Lower
Area / block (mm²) 0.0365–0.78 Larger Higher

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to High-Accuracy DS-CIM1 Variant.