3D Stack In-Sensor-Computing (3DS-ISC)

Updated 30 December 2025

3D Stack In-Sensor-Computing (3DS-ISC) is a vertically integrated system that tightly stacks sensor, memory, and compute layers to enable efficient edge AI processing.
It minimizes off-pixel data movement using advanced interconnects like Cu–Cu hybrid bonding and TSVs, which significantly cuts power consumption and latency.
Empirical evaluations demonstrate that 3DS-ISC supports event vision, AR/VR, and DNN acceleration with major improvements in energy efficiency, area reduction, and bandwidth performance.

3D Stack In-Sensor-Computing (3DS-ISC) architecture refers to the vertical integration of sensing, memory, and computational logic in a multi-die, wafer-stacked system, where visual signal acquisition and significant portions of signal processing or inference are performed within or immediately below the sensor plane. This paradigm minimizes off-pixel data movement, dramatically cuts power and latency, and enables real-time edge AI workloads. The architecture and its derivatives are crucial for applications such as edge AI vision, neuromorphic event processing, and AR/VR, providing orders-of-magnitude improvements in power, area, and bandwidth efficiency versus conventional 2D and off-sensor compute approaches (Tain et al., 18 Jun 2025, Shang et al., 23 Dec 2025, Kaiser et al., 2023, Gomez et al., 2022).

1. Physical Structure and Layer Functions

3DS-ISC architectures are characterized by tightly stacked vertical integration, typically comprising at least three functionally distinct tiers, interconnected via direct hybrid bonding and dense Through-Silicon Vias (TSVs) or micro-bumps.

Top (Sensor) Layer: Implements a photodiode array (CMOS CIS, DVS, or DPS), realizing the physical interface to the scene (e.g., 4096×3072 RGB pixels at 40 nm CMOS in J3DAI (Tain et al., 18 Jun 2025), standard 65 nm DVS in event cameras (Shang et al., 23 Dec 2025), or BI-CIS with analog front-end (Kaiser et al., 2023)).
Middle Layer: Integrates analog front-end (correlated double sampling, column ADCs), local memory (e.g., SRAM, eDRAM for timestamping or in-memory computing), and in some systems, analog MACs (with RRAM, SRAM, or eDRAM). Logic subsystems including host CPUs (e.g., RISC-V), image signal processors, and interface controllers (HSI, DMA) reside here.
Bottom Layer: Hosts high-density compute engines (custom DNN accelerators, binary engines, or programmable SIMD), SRAM/L2 caches, and weight or feature map memory. In some event-driven or AR/VR variants, this tier includes STT-MRAM banks for low-leakage weight storage (Gomez et al., 2022), or peripheral I/O and logic circuits.

Inter-die data movement is provided via ultra-fine pitch Cu–Cu hybrid bonding (1–10 µm pitch, <0.5 fJ/byte), HD-TSVs (1–2 µm pitch, >1 TB/s/mm² aggregate bandwidth (Tain et al., 18 Jun 2025)), or other wafer-level integration methods. Table 1 summarizes major stack components across leading designs.

Stack Function	J3DAI (Tain et al., 18 Jun 2025)	3DS-ISC DVS (Shang et al., 23 Dec 2025)	P²M (Kaiser et al., 2023)
Sensor Layer	CMOS CIS photodiode, 12 MP	65 nm DVS pixel, 320×240	BI-CIS PD, CDS front-end
Memory Layer	ISP, L2 SRAM (2 MB), ADC	6T-1C eDRAM w/ MOMCAP, LL	RRAM per weight, in-pixel
Compute Layer	DNN acc. (153.6 GMAC/s peak)	Peripheral logic (decode)	ADCs, up/down MACs, digital
TSV/Bonding	200+ Gb/s/mm², 2 µm pitch	Cu–Cu micro-bump, TSV (pwr)	Cu–Cu bonding, TSV

2. In-Sensor and Near-Sensor Computing Elements

3DS-ISC leverages computation physically and/or electrically adjacent to the sensor array, partitioning workloads to optimize for energy, latency, and bandwidth:

Analog In-Pixel Processing: In P²M, RRAM cells directly under each photodiode implement per-pixel analog multiply-accumulate; convolution kernels are mapped as RRAM weights, and MAC accumulation occurs in the analog domain prior to ADC conversion (Kaiser et al., 2023). In 3DS-ISC event sensors, eDRAM capacitors implement in-pixel exponential decay, serving as analog time-surface normalizers for event timestamps (Shang et al., 23 Dec 2025).
Digital and Mixed-Signal DNN Acceleration: Multi-cluster SIMD DNN engines in logic layers support high-parallelism INT8 (or binary op) inference (e.g., 768 MAC/cycle @200 MHz, 153.6 GMAC/s in J3DAI (Tain et al., 18 Jun 2025)), orchestrated by DMA and RISC-V hosts.
In-Memory Computing: SRAM, eDRAM, and STT-MRAM are arranged to allow direct or near-memory compute (e.g., feature map tiling, reduced data shuffle), lowering both off-stack communication and static power.

3. Interconnects, Dataflow, and Memory Hierarchy

3DS-ISC systems are defined by high-bandwidth, low-latency vertical interconnects:

Cu–Cu Hybrid Bonds and Micro-TSVs: Enable direct pixel–memory or sensor–compute connections, achieving >1 TB/s/mm² at 200 MHz (J3DAI), supporting full-frame or sub-sampled pixel transfer at up to multi-Tb/s (e.g., 256×192×8b pixels @200 MHz ≈7.86 Tb/s (Tain et al., 18 Jun 2025)).
On-chip Data Movement: DMA and DMPA units provide 1024 bits/clock transfer between SRAM banks and neural clusters; automatic index units and multicast register architectures reduce instruction and weight-load latency (Tain et al., 18 Jun 2025). In event cameras, direct event pulse routing via individual EV lines to the co-located eDRAM avoids energy and congestion associated with conventional shared digital buses (Shang et al., 23 Dec 2025).
Memory Hierarchies: L2 SRAM (2–3 MB/stack), L1 SRAM per neural cluster (e.g., 2×512 KB), in-pixel memory (RRAM/eDRAM), and hybrid STT-MRAM enable extremely localized storage of weights and activations, with memory optimization via post-training quantization (Aidge framework: FP32→INT8 yields 4× storage savings; active power lowered by 20–30% (Tain et al., 18 Jun 2025); hybrid SRAM/STT-MRAM reduces on-sensor power by 39% (Gomez et al., 2022)).

4. Performance Metrics and Efficiency Gains

3DS-ISC architectures deliver substantial improvements in energy, area, bandwidth, and latency.

Energy and Power: For event cameras, 3D-ISC achieves 0.04 pJ/event (vs 2.8 pJ for 2D-ISC and 60 pJ for 16-bit SRAM) with total static leakage 0.5 µW for a 320×240 array (35 mW for conventional) (Shang et al., 23 Dec 2025). For DNN accelerators (J3DAI), MobileNetV2 (INT8) runs at 4.04 ms/inference (≈248 fps) at 186.7 mW, corresponding to ≈0.75 mJ per inference (Tain et al., 18 Jun 2025).
Area: Per-pixel compute elements shrink to 20 µm² in event cameras, a 1.9–3× reduction over 2D/16-bit digital; DNN accelerator + L2 SRAM occupies 16 mm² of the bottom die in J3DAI (die-limited by pixel array area) (Shang et al., 23 Dec 2025, Tain et al., 18 Jun 2025).
Bandwidth: Vertical interconnects (TSVs, micro-bumps) provide up to order-Tb/s aggregate bandwidth, enabling local movement of full sensor frames while off-stack I/O (e.g., MIPI in AR/VR) is minimized (>10× reduction for region-of-interest workload extraction (Gomez et al., 2022)).
Latency: Write latency per event in 3DS-ISC is 5 ns (vs 11–15 ns for alternatives), analog in-pixel processing completes before ADC ramp in P²M; local inference latency for DNNs is in the single ms range (Shang et al., 23 Dec 2025, Tain et al., 18 Jun 2025).

5. Algorithmic and Co-Design Considerations

Optimal use of 3DS-ISC requires strong co-design across hardware, circuits, and algorithms:

Quantization and Hardware-Aware Training: Post-training quantization, as in the Aidge framework, yields model compressibility suited for in-sensor SRAM only operation (Tain et al., 18 Jun 2025). Hardware-in-the-loop training with analog non-idealities (e.g., RRAM variability, ADC noise, f_nonlin fitting functions) supports algorithmic resilience (Kaiser et al., 2023).
Network Partitioning and Dataflow: For distributed AR/VR, task partitioning assigns shallow DetNet inference to sensor stack, passing only compressed ROI activation maps (as little as 10% of the raw image data) off-chip. Toolchains such as DORY automate tiling and L1/L2 sizing (Gomez et al., 2022).
Device and Process Scaling: FDSOI (for low leakage/thermal), eDRAM with ultra-low-leakage LL switches, and RRAM advances support both scalability and reliability in aggressive nodes (down to 16 nm for logic and memory layers (Gomez et al., 2022); potential for eDRAM scaling to sub-10 nm (Shang et al., 23 Dec 2025)).

6. Application Demonstrations and Empirical Results

3DS-ISC architectures have been evaluated across classification, segmentation, denoising, and reconstruction tasks:

Event Camera Vision: 3DS-ISC construction of analog time-surfaces enables state-of-the-art results for event-driven tasks: 99% on N-MNIST, 85% on N-Caltech101, 78% on CIFAR10-DVS, 97% on DVS128 Gesture, exceeding or matching digital and SOTA approaches (Shang et al., 23 Dec 2025). Spatio-temporal denoising achieves ROC-AUC 0.96 (hotel-bar), 0.86 (driving), essentially lossless compared to digital (Shang et al., 23 Dec 2025).
Conventional Frame-Based CV: P²M achieves 4×–30× bandwidth reduction, 0.2–0.7× energy/frame, and <2% accuracy degradation on benchmarks (Visual Wake Words, BDD100K) versus standard CIS + SoC pipelines (Kaiser et al., 2023).
Edge AI in Imaging: J3DAI demonstrates competitive throughput (135–248 fps) on MobileNet and segmentation tasks with sub-0.75 mJ/inference, enabled by full INT8 model SRAM fitting and zero need for external DRAM (Tain et al., 18 Jun 2025).
AR/VR System Integration: On-sensor distributed DetNet in AR/VR hand tracking reduces system power by 24% (compared to aggregator-only) and attains sub-10 ms end-to-end latency, with privacy preserved by keeping raw imagery on-stack (Gomez et al., 2022).

7. Limitations, Trade-Offs, and Future Directions

3DS-ISC faces several scaling, reliability, and circuit-level limitations:

Thermal and Bonding Constraints: TSV density is limited by thermal load and area overhead (<5% area penalty at 2 µm pitch is a typical compromise (Tain et al., 18 Jun 2025)), while the bond pitch vs. pixel pitch trade-off can impact pixel density unless monolithic 3D or advanced Cu–Cu is used (Kaiser et al., 2023).
Analog Variability: RRAM and eDRAM cell variability (<2% V(t) variation per cell in 3DS-ISC (Shang et al., 23 Dec 2025)), amplifier mismatch, and digital quantization limit precision for the most sensitive applications.
Memory–Compute Partitioning: The pixel die area remains a hard upper bound, forcing compute and memory resource budgets to track sensor format (“top-die-limited” system design (Tain et al., 18 Jun 2025)).
Scaling and Extensions: Potential extensions include per-channel (ON/OFF) in-pixel computation (doubling area, boosting event classification accuracy by ∼6% on CIFAR10-DVS (Shang et al., 23 Dec 2025)), stackable mixed-signal CNNs, and eDRAM/FinFET process scaling to extend analog retention (Shang et al., 23 Dec 2025).

A plausible implication is that systematic hardware/software co-design, with automated HW-aware training and energy/power accounting, is necessary for continued architectural optimization (Kaiser et al., 2023, Gomez et al., 2022). Scaling 3DS-ISC for broader applications will require further integration of advanced device structures and adaptive circuit calibration schemes.