LIMO Macro: In-Memory Annealing & VMM
- LIMO Macro is a compute-in-memory architecture that integrates annealing-based combinatorial optimization with energy-efficient vector-matrix multiplication for NP-hard problems.
- It leverages STT-MTJ based stochasticity and true random number generation with SWAI techniques to accelerate complex problems like the Traveling Salesman Problem.
- Its modular design enables concurrent optimization and low-latency AI inference on edge devices, achieving significant energy savings and performance gains.
LIMO Macro refers to a mixed-signal computational primitive that integrates annealing-based combinatorial optimization and energy-efficient vector-matrix multiplication in a single compute-in-memory (CiM) macro architecture. Designed for applications at the edge, the LIMO macro leverages hardware stochasticity via spin-transfer-torque magnetic tunnel junctions (STT-MTJs) and tightly coupled SRAM arrays to accelerate NP-hard optimization problems—most notably the Traveling Salesman Problem (TSP)—and to enable low-latency neural network inference within the same physical substrate (Holla et al., 29 Dec 2025).
1. Physical Architecture and Circuit Components
The LIMO macro centers on an 80×80 bit-cell crossbar based on standard 8T-SRAM fabricated in 65 nm CMOS. This crossbar is divided into five 16×80 sub-arrays; within each, a 16×64 section stores 4-bit coupling weights while a 16×16 region is dedicated to spin state storage. Each bit-cell provides an extra read port for in-memory logic and supports both analog and digital readout.
For analog VMM (vector-matrix multiplication), adjacent columns implement ternary weights {–1,0,+1} using a push–pull mechanism: even columns use PMOS pull-ups and odd columns use NMOS pull-downs. Peripheral circuits include:
- True random number generators (TRNGs) realized with 16 parallel differential sense amplifiers and STT-MTJ stacks, providing stochasticity for annealing.
- Stochastic bit comparators for Bernoulli sampling, both global (16 bit, for acceptance probability) and local (4 bit, for data-dependent spin updates).
- Specialized write drivers supporting probabilistic bi-directional MTJ switching.
- SRAM scratchpads for solution logging.
- Per-column sense amplifiers for VMM quantization.
A finite-state machine (FSM) sequences all annealing and VMM operations, enabling flexible context switching between optimization and inference tasks.
2. In-Memory Annealing and Optimization Algorithm
The annealing engine encodes TSP or generic quadratic unconstrained binary optimization (QUBO) problems as Ising Hamiltonians:
where the problem to be solved (e.g., TSP) is converted into appropriate couplings and biases mapped to the SRAM array. Each candidate tour corresponds to a configuration of spins with binary or bipolar values, setting one-hot constraints for city/position assignment.
Optimization proceeds via a hardware-embedded variant of simulated annealing:
- At every step, a spin (or city insertion) is potentially updated, with state acceptance controlled by a hardware-annealed Metropolis rule. The acceptance probability for an energy-increasing change is , where temperature decreases over time.
- Rather than conventional city-swap moves, the LIMO macro uses a "Significance-Weighted Annealed Insertion" (SWAI) approach: the tour is constructed incrementally, with insertion candidates selected stochastically or greedily based on a decaying probability schedule. The probability of selecting city for position is:
where is the distance from the previous city, and is the largest such distance in the candidate set.
This mapping accelerates local move evaluation and randomization using parallel analog/digital operations and TRNGs, allowing each macro to solve up to five problems concurrently.
3. STT-MTJ-Based Hardware Stochasticity
Stochastic switching in perpendicular STT-MTJs (tunnel magnetoresistance ≈163%, k, k) is exploited for annealing randomness. The macro includes a compact SPICE-derived 2D Fokker–Planck model to set pulse duration and current for approximately 50% switching at nominal conditions. By integrating bidirectional write drivers and XORing the outputs of identically structured TRNG units, the macro generates unbiased Bernoulli(½) streams at high speed. These support not only global update acceptance but also per-spin and data-dependent stochastic gate operations for SWAI and VMM.
4. Divide-and-Conquer Hierarchical Refinement
To address TSP instances of up to 85,900 cities, LIMO incorporates a scalable hierarchical solve-refine framework:
- Cities are recursively bisected via PCA along the first principal component until clusters of size (typically ) are formed.
- Each macro instance solves its corresponding sub-TSP via SWAI, then optional local segment passes are performed for refinement.
- Clusters are stitched through border city identification (via FixLinks), and open TSPs for clusters are further optimized using SWAI, segment refinement, and efficient -nearest TwoOpt local search (, complexity per cluster).
- This process admits near-ideal parallelization: each subproblem maps to a separate macro, allowing full hardware utilization and linear scaling.
5. Vector-Matrix Multiply (VMM) Mode
VMM is supported through learned step size quantization of both weights and activations:
- Activations: , .
- Weights: , for -bit quantization.
Ternary weights are realized in the crossbar, supporting bit-serial streaming of activations, analog accumulation on bitlines, and quantization of the sign via sense amplifiers (obviating SAR-ADCs). Output scaling and gradient propagation are managed using the clipped partial sums and bit-slice logic. This mode preserves the standard SRAM VMM path, as the annealing-related peripherals are designed as modular add-ons.
6. Performance Metrics, Efficiency, and Comparative Results
The LIMO macro achieves the following:
- Annealing mode (100 MHz, 65 nm): Approximately 25.4 cycles per insertion, handling five parallel optimization problems, total power ≈0.37 mW (TRNG ≤ 8.5% macro area).
- VMM mode: Single 1-bit partial sum per column per cycle; energy approximately 2.2 fJ per column.
- Large-scale TSP: 0.00135 mW/spin (15× lower than TAXI [baseline hardware annealer]); time-to-solution up to 5× faster than TAXI on 85,900 city instances; solution deviation ratio improved by ~37.5%.
- Edge AI inference: For ResNet-20 on CIFAR-10, 89.3% accuracy (vs. 89.5% for software baseline), 1.3–2.1× less energy, and 1.2–1.3× lower latency. For ResNet-SSD face detection, 95.7% AP (vs. 97.7%).
The table below summarizes the main efficiency claims (from (Holla et al., 29 Dec 2025)):
| Mode | Power/spin | Latency gain (vs. TAXI) | Accuracy (ResNet-20) |
|---|---|---|---|
| Annealing | 0.00135 mW | ≈5× | — |
| VMM (AI inf.) | — | 1.2–1.3× | 89.3% |
7. Scalability, Application Domains, and Prospective Enhancements
The macro’s modularity—integrating annealing peripherals as overlays to standard 8T-SRAM cores—enables both stackable sub-array scaling (five per macro, multiple macros per core), and deployment in spatial architectures for massive parallelism. This supports -city TSPs and comparable-sized Max-Cut, SAT, or general QUBO problems.
Target domains include:
- Combinatorial optimization for logistics, scheduling, EDA, and chip placement
- Probabilistic computing and hardware-embedded sampling
- Analog-accelerated AI inference on edge and IoT devices
Proposed directions include:
- Deeper hardware-software co-design of combinatorial and neural algorithms
- Full-stack elimination of host CPU dependencies
- Generalization to other NP-hard problem classes via further Ising model embeddings
A plausible implication is that the combination of hardware stochasticity, in-memory computation, and hierarchical parallelism in LIMO offers an architectural template for energy-efficient, scalable edge optimization and inference—a direction of increasing relevance in decentralized and real-time AI among resource-constrained devices (Holla et al., 29 Dec 2025).