Papers
Topics
Authors
Recent
Search
2000 character limit reached

LIMO Macro: In-Memory Annealing & VMM

Updated 5 January 2026
  • LIMO Macro is a compute-in-memory architecture that integrates annealing-based combinatorial optimization with energy-efficient vector-matrix multiplication for NP-hard problems.
  • It leverages STT-MTJ based stochasticity and true random number generation with SWAI techniques to accelerate complex problems like the Traveling Salesman Problem.
  • Its modular design enables concurrent optimization and low-latency AI inference on edge devices, achieving significant energy savings and performance gains.

LIMO Macro refers to a mixed-signal computational primitive that integrates annealing-based combinatorial optimization and energy-efficient vector-matrix multiplication in a single compute-in-memory (CiM) macro architecture. Designed for applications at the edge, the LIMO macro leverages hardware stochasticity via spin-transfer-torque magnetic tunnel junctions (STT-MTJs) and tightly coupled SRAM arrays to accelerate NP-hard optimization problems—most notably the Traveling Salesman Problem (TSP)—and to enable low-latency neural network inference within the same physical substrate (Holla et al., 29 Dec 2025).

1. Physical Architecture and Circuit Components

The LIMO macro centers on an 80×80 bit-cell crossbar based on standard 8T-SRAM fabricated in 65 nm CMOS. This crossbar is divided into five 16×80 sub-arrays; within each, a 16×64 section stores 4-bit coupling weights while a 16×16 region is dedicated to spin state storage. Each bit-cell provides an extra read port for in-memory logic and supports both analog and digital readout.

For analog VMM (vector-matrix multiplication), adjacent columns implement ternary weights {–1,0,+1} using a push–pull mechanism: even columns use PMOS pull-ups and odd columns use NMOS pull-downs. Peripheral circuits include:

  • True random number generators (TRNGs) realized with 16 parallel differential sense amplifiers and STT-MTJ stacks, providing stochasticity for annealing.
  • Stochastic bit comparators for Bernoulli sampling, both global (16 bit, for acceptance probability) and local (4 bit, for data-dependent spin updates).
  • Specialized write drivers supporting probabilistic bi-directional MTJ switching.
  • SRAM scratchpads for solution logging.
  • Per-column sense amplifiers for VMM quantization.

A finite-state machine (FSM) sequences all annealing and VMM operations, enabling flexible context switching between optimization and inference tasks.

2. In-Memory Annealing and Optimization Algorithm

The annealing engine encodes TSP or generic quadratic unconstrained binary optimization (QUBO) problems as Ising Hamiltonians:

H(s)=i,jJijsisjihisiH(s) = -\sum_{\langle i,j\rangle} J_{ij} s_i s_j - \sum_i h_i s_i

where the problem to be solved (e.g., TSP) is converted into appropriate couplings JijJ_{ij} and biases hih_i mapped to the SRAM array. Each candidate tour corresponds to a configuration of N2N^2 spins sv,ps_{v,p} with binary or bipolar values, setting one-hot constraints for city/position assignment.

Optimization proceeds via a hardware-embedded variant of simulated annealing:

  • At every step, a spin (or city insertion) is potentially updated, with state acceptance controlled by a hardware-annealed Metropolis rule. The acceptance probability for an energy-increasing change is exp(ΔH/T)\exp(-\Delta H/T), where temperature TT decreases over time.
  • Rather than conventional city-swap moves, the LIMO macro uses a "Significance-Weighted Annealed Insertion" (SWAI) approach: the tour is constructed incrementally, with insertion candidates selected stochastically or greedily based on a decaying probability schedule. The probability of selecting city jj for position kk is:

Pj=1djdmaxP_j = 1 - \frac{d_j}{d_{\max}}

where djd_j is the distance from the previous city, and dmaxd_{\max} is the largest such distance in the candidate set.

This mapping accelerates local move evaluation and randomization using parallel analog/digital operations and TRNGs, allowing each macro to solve up to five problems concurrently.

3. STT-MTJ-Based Hardware Stochasticity

Stochastic switching in perpendicular STT-MTJs (tunnel magnetoresistance ≈163%, RP2.44R_P \approx 2.44 kΩ\Omega, RAP6.41R_{AP} \approx 6.41 kΩ\Omega) is exploited for annealing randomness. The macro includes a compact SPICE-derived 2D Fokker–Planck model to set pulse duration and current for approximately 50% switching at nominal conditions. By integrating bidirectional write drivers and XORing the outputs of identically structured TRNG units, the macro generates unbiased Bernoulli(½) streams at high speed. These support not only global update acceptance but also per-spin and data-dependent stochastic gate operations for SWAI and VMM.

4. Divide-and-Conquer Hierarchical Refinement

To address TSP instances of up to 85,900 cities, LIMO incorporates a scalable hierarchical solve-refine framework:

  1. Cities are recursively bisected via PCA along the first principal component until clusters of size TT (typically T=16T=16) are formed.
  2. Each macro instance solves its corresponding sub-TSP via SWAI, then optional local segment passes are performed for refinement.
  3. Clusters are stitched through border city identification (via FixLinks), and open TSPs for clusters are further optimized using SWAI, segment refinement, and efficient KK-nearest TwoOpt local search (K20K \leq 20, O(nK)\mathcal{O}(nK) complexity per cluster).
  4. This process admits near-ideal parallelization: each subproblem maps to a separate macro, allowing full hardware utilization and linear scaling.

5. Vector-Matrix Multiply (VMM) Mode

VMM is supported through learned step size quantization of both weights and activations:

  • Activations: xint=clip(x/α,QN,QP)x_\text{int} = \left\lfloor \text{clip}(x/\alpha, Q_N, Q_P) \right\rfloor, xq=xintαx_q = x_\text{int}\cdot \alpha.
  • Weights: QN=2B1Q_N = -2^{B-1}, QP=2B11Q_P = 2^{B-1}-1 for BB-bit quantization.

Ternary weights are realized in the crossbar, supporting bit-serial streaming of activations, analog accumulation on bitlines, and quantization of the sign via sense amplifiers (obviating SAR-ADCs). Output scaling and gradient propagation are managed using the clipped partial sums and bit-slice logic. This mode preserves the standard SRAM VMM path, as the annealing-related peripherals are designed as modular add-ons.

6. Performance Metrics, Efficiency, and Comparative Results

The LIMO macro achieves the following:

  • Annealing mode (100 MHz, 65 nm): Approximately 25.4 cycles per insertion, handling five parallel optimization problems, total power ≈0.37 mW (TRNG ≤ 8.5% macro area).
  • VMM mode: Single 1-bit partial sum per column per cycle; energy approximately 2.2 fJ per column.
  • Large-scale TSP: 0.00135 mW/spin (15× lower than TAXI [baseline hardware annealer]); time-to-solution up to 5× faster than TAXI on 85,900 city instances; solution deviation ratio improved by ~37.5%.
  • Edge AI inference: For ResNet-20 on CIFAR-10, 89.3% accuracy (vs. 89.5% for software baseline), 1.3–2.1× less energy, and 1.2–1.3× lower latency. For ResNet-SSD face detection, 95.7% AP (vs. 97.7%).

The table below summarizes the main efficiency claims (from (Holla et al., 29 Dec 2025)):

Mode Power/spin Latency gain (vs. TAXI) Accuracy (ResNet-20)
Annealing 0.00135 mW ≈5×
VMM (AI inf.) 1.2–1.3× 89.3%

7. Scalability, Application Domains, and Prospective Enhancements

The macro’s modularity—integrating annealing peripherals as overlays to standard 8T-SRAM cores—enables both stackable sub-array scaling (five per macro, multiple macros per core), and deployment in spatial architectures for massive parallelism. This supports O(105)O(10^5)-city TSPs and comparable-sized Max-Cut, SAT, or general QUBO problems.

Target domains include:

  • Combinatorial optimization for logistics, scheduling, EDA, and chip placement
  • Probabilistic computing and hardware-embedded sampling
  • Analog-accelerated AI inference on edge and IoT devices

Proposed directions include:

  • Deeper hardware-software co-design of combinatorial and neural algorithms
  • Full-stack elimination of host CPU dependencies
  • Generalization to other NP-hard problem classes via further Ising model embeddings

A plausible implication is that the combination of hardware stochasticity, in-memory computation, and hierarchical parallelism in LIMO offers an architectural template for energy-efficient, scalable edge optimization and inference—a direction of increasing relevance in decentralized and real-time AI among resource-constrained devices (Holla et al., 29 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LIMO Macro.