3D Stack In-Sensor-Computing (3DS-ISC)
- 3D Stack In-Sensor-Computing (3DS-ISC) is a vertically integrated system that tightly stacks sensor, memory, and compute layers to enable efficient edge AI processing.
- It minimizes off-pixel data movement using advanced interconnects like Cu–Cu hybrid bonding and TSVs, which significantly cuts power consumption and latency.
- Empirical evaluations demonstrate that 3DS-ISC supports event vision, AR/VR, and DNN acceleration with major improvements in energy efficiency, area reduction, and bandwidth performance.
3D Stack In-Sensor-Computing (3DS-ISC) architecture refers to the vertical integration of sensing, memory, and computational logic in a multi-die, wafer-stacked system, where visual signal acquisition and significant portions of signal processing or inference are performed within or immediately below the sensor plane. This paradigm minimizes off-pixel data movement, dramatically cuts power and latency, and enables real-time edge AI workloads. The architecture and its derivatives are crucial for applications such as edge AI vision, neuromorphic event processing, and AR/VR, providing orders-of-magnitude improvements in power, area, and bandwidth efficiency versus conventional 2D and off-sensor compute approaches (Tain et al., 18 Jun 2025, Shang et al., 23 Dec 2025, Kaiser et al., 2023, Gomez et al., 2022).
1. Physical Structure and Layer Functions
3DS-ISC architectures are characterized by tightly stacked vertical integration, typically comprising at least three functionally distinct tiers, interconnected via direct hybrid bonding and dense Through-Silicon Vias (TSVs) or micro-bumps.
- Top (Sensor) Layer: Implements a photodiode array (CMOS CIS, DVS, or DPS), realizing the physical interface to the scene (e.g., 4096×3072 RGB pixels at 40 nm CMOS in J3DAI (Tain et al., 18 Jun 2025), standard 65 nm DVS in event cameras (Shang et al., 23 Dec 2025), or BI-CIS with analog front-end (Kaiser et al., 2023)).
- Middle Layer: Integrates analog front-end (correlated double sampling, column ADCs), local memory (e.g., SRAM, eDRAM for timestamping or in-memory computing), and in some systems, analog MACs (with RRAM, SRAM, or eDRAM). Logic subsystems including host CPUs (e.g., RISC-V), image signal processors, and interface controllers (HSI, DMA) reside here.
- Bottom Layer: Hosts high-density compute engines (custom DNN accelerators, binary engines, or programmable SIMD), SRAM/L2 caches, and weight or feature map memory. In some event-driven or AR/VR variants, this tier includes STT-MRAM banks for low-leakage weight storage (Gomez et al., 2022), or peripheral I/O and logic circuits.
Inter-die data movement is provided via ultra-fine pitch Cu–Cu hybrid bonding (1–10 µm pitch, <0.5 fJ/byte), HD-TSVs (1–2 µm pitch, >1 TB/s/mm² aggregate bandwidth (Tain et al., 18 Jun 2025)), or other wafer-level integration methods. Table 1 summarizes major stack components across leading designs.
| Stack Function | J3DAI (Tain et al., 18 Jun 2025) | 3DS-ISC DVS (Shang et al., 23 Dec 2025) | P²M (Kaiser et al., 2023) |
|---|---|---|---|
| Sensor Layer | CMOS CIS photodiode, 12 MP | 65 nm DVS pixel, 320×240 | BI-CIS PD, CDS front-end |
| Memory Layer | ISP, L2 SRAM (2 MB), ADC | 6T-1C eDRAM w/ MOMCAP, LL | RRAM per weight, in-pixel |
| Compute Layer | DNN acc. (153.6 GMAC/s peak) | Peripheral logic (decode) | ADCs, up/down MACs, digital |
| TSV/Bonding | 200+ Gb/s/mm², 2 µm pitch | Cu–Cu micro-bump, TSV (pwr) | Cu–Cu bonding, TSV |
2. In-Sensor and Near-Sensor Computing Elements
3DS-ISC leverages computation physically and/or electrically adjacent to the sensor array, partitioning workloads to optimize for energy, latency, and bandwidth:
- Analog In-Pixel Processing: In P²M, RRAM cells directly under each photodiode implement per-pixel analog multiply-accumulate; convolution kernels are mapped as RRAM weights, and MAC accumulation occurs in the analog domain prior to ADC conversion (Kaiser et al., 2023). In 3DS-ISC event sensors, eDRAM capacitors implement in-pixel exponential decay, serving as analog time-surface normalizers for event timestamps (Shang et al., 23 Dec 2025).
- Digital and Mixed-Signal DNN Acceleration: Multi-cluster SIMD DNN engines in logic layers support high-parallelism INT8 (or binary op) inference (e.g., 768 MAC/cycle @200 MHz, 153.6 GMAC/s in J3DAI (Tain et al., 18 Jun 2025)), orchestrated by DMA and RISC-V hosts.
- In-Memory Computing: SRAM, eDRAM, and STT-MRAM are arranged to allow direct or near-memory compute (e.g., feature map tiling, reduced data shuffle), lowering both off-stack communication and static power.
3. Interconnects, Dataflow, and Memory Hierarchy
3DS-ISC systems are defined by high-bandwidth, low-latency vertical interconnects:
- Cu–Cu Hybrid Bonds and Micro-TSVs: Enable direct pixel–memory or sensor–compute connections, achieving >1 TB/s/mm² at 200 MHz (J3DAI), supporting full-frame or sub-sampled pixel transfer at up to multi-Tb/s (e.g., 256×192×8b pixels @200 MHz ≈7.86 Tb/s (Tain et al., 18 Jun 2025)).
- On-chip Data Movement: DMA and DMPA units provide 1024 bits/clock transfer between SRAM banks and neural clusters; automatic index units and multicast register architectures reduce instruction and weight-load latency (Tain et al., 18 Jun 2025). In event cameras, direct event pulse routing via individual EV lines to the co-located eDRAM avoids energy and congestion associated with conventional shared digital buses (Shang et al., 23 Dec 2025).
- Memory Hierarchies: L2 SRAM (2–3 MB/stack), L1 SRAM per neural cluster (e.g., 2×512 KB), in-pixel memory (RRAM/eDRAM), and hybrid STT-MRAM enable extremely localized storage of weights and activations, with memory optimization via post-training quantization (Aidge framework: FP32→INT8 yields 4× storage savings; active power lowered by 20–30% (Tain et al., 18 Jun 2025); hybrid SRAM/STT-MRAM reduces on-sensor power by 39% (Gomez et al., 2022)).
4. Performance Metrics and Efficiency Gains
3DS-ISC architectures deliver substantial improvements in energy, area, bandwidth, and latency.
- Energy and Power: For event cameras, 3D-ISC achieves 0.04 pJ/event (vs 2.8 pJ for 2D-ISC and 60 pJ for 16-bit SRAM) with total static leakage 0.5 µW for a 320×240 array (35 mW for conventional) (Shang et al., 23 Dec 2025). For DNN accelerators (J3DAI), MobileNetV2 (INT8) runs at 4.04 ms/inference (≈248 fps) at 186.7 mW, corresponding to ≈0.75 mJ per inference (Tain et al., 18 Jun 2025).
- Area: Per-pixel compute elements shrink to 20 µm² in event cameras, a 1.9–3× reduction over 2D/16-bit digital; DNN accelerator + L2 SRAM occupies 16 mm² of the bottom die in J3DAI (die-limited by pixel array area) (Shang et al., 23 Dec 2025, Tain et al., 18 Jun 2025).
- Bandwidth: Vertical interconnects (TSVs, micro-bumps) provide up to order-Tb/s aggregate bandwidth, enabling local movement of full sensor frames while off-stack I/O (e.g., MIPI in AR/VR) is minimized (>10× reduction for region-of-interest workload extraction (Gomez et al., 2022)).
- Latency: Write latency per event in 3DS-ISC is 5 ns (vs 11–15 ns for alternatives), analog in-pixel processing completes before ADC ramp in P²M; local inference latency for DNNs is in the single ms range (Shang et al., 23 Dec 2025, Tain et al., 18 Jun 2025).
5. Algorithmic and Co-Design Considerations
Optimal use of 3DS-ISC requires strong co-design across hardware, circuits, and algorithms:
- Quantization and Hardware-Aware Training: Post-training quantization, as in the Aidge framework, yields model compressibility suited for in-sensor SRAM only operation (Tain et al., 18 Jun 2025). Hardware-in-the-loop training with analog non-idealities (e.g., RRAM variability, ADC noise, f_nonlin fitting functions) supports algorithmic resilience (Kaiser et al., 2023).
- Network Partitioning and Dataflow: For distributed AR/VR, task partitioning assigns shallow DetNet inference to sensor stack, passing only compressed ROI activation maps (as little as 10% of the raw image data) off-chip. Toolchains such as DORY automate tiling and L1/L2 sizing (Gomez et al., 2022).
- Device and Process Scaling: FDSOI (for low leakage/thermal), eDRAM with ultra-low-leakage LL switches, and RRAM advances support both scalability and reliability in aggressive nodes (down to 16 nm for logic and memory layers (Gomez et al., 2022); potential for eDRAM scaling to sub-10 nm (Shang et al., 23 Dec 2025)).
6. Application Demonstrations and Empirical Results
3DS-ISC architectures have been evaluated across classification, segmentation, denoising, and reconstruction tasks:
- Event Camera Vision: 3DS-ISC construction of analog time-surfaces enables state-of-the-art results for event-driven tasks: 99% on N-MNIST, 85% on N-Caltech101, 78% on CIFAR10-DVS, 97% on DVS128 Gesture, exceeding or matching digital and SOTA approaches (Shang et al., 23 Dec 2025). Spatio-temporal denoising achieves ROC-AUC 0.96 (hotel-bar), 0.86 (driving), essentially lossless compared to digital (Shang et al., 23 Dec 2025).
- Conventional Frame-Based CV: P²M achieves 4×–30× bandwidth reduction, 0.2–0.7× energy/frame, and <2% accuracy degradation on benchmarks (Visual Wake Words, BDD100K) versus standard CIS + SoC pipelines (Kaiser et al., 2023).
- Edge AI in Imaging: J3DAI demonstrates competitive throughput (135–248 fps) on MobileNet and segmentation tasks with sub-0.75 mJ/inference, enabled by full INT8 model SRAM fitting and zero need for external DRAM (Tain et al., 18 Jun 2025).
- AR/VR System Integration: On-sensor distributed DetNet in AR/VR hand tracking reduces system power by 24% (compared to aggregator-only) and attains sub-10 ms end-to-end latency, with privacy preserved by keeping raw imagery on-stack (Gomez et al., 2022).
7. Limitations, Trade-Offs, and Future Directions
3DS-ISC faces several scaling, reliability, and circuit-level limitations:
- Thermal and Bonding Constraints: TSV density is limited by thermal load and area overhead (<5% area penalty at 2 µm pitch is a typical compromise (Tain et al., 18 Jun 2025)), while the bond pitch vs. pixel pitch trade-off can impact pixel density unless monolithic 3D or advanced Cu–Cu is used (Kaiser et al., 2023).
- Analog Variability: RRAM and eDRAM cell variability (<2% V(t) variation per cell in 3DS-ISC (Shang et al., 23 Dec 2025)), amplifier mismatch, and digital quantization limit precision for the most sensitive applications.
- Memory–Compute Partitioning: The pixel die area remains a hard upper bound, forcing compute and memory resource budgets to track sensor format (“top-die-limited” system design (Tain et al., 18 Jun 2025)).
- Scaling and Extensions: Potential extensions include per-channel (ON/OFF) in-pixel computation (doubling area, boosting event classification accuracy by ∼6% on CIFAR10-DVS (Shang et al., 23 Dec 2025)), stackable mixed-signal CNNs, and eDRAM/FinFET process scaling to extend analog retention (Shang et al., 23 Dec 2025).
A plausible implication is that systematic hardware/software co-design, with automated HW-aware training and energy/power accounting, is necessary for continued architectural optimization (Kaiser et al., 2023, Gomez et al., 2022). Scaling 3DS-ISC for broader applications will require further integration of advanced device structures and adaptive circuit calibration schemes.