Papers
Topics
Authors
Recent
Search
2000 character limit reached

Async Memory Unit (AMU) Overview

Updated 20 January 2026
  • Async Memory Unit (AMU) is a hardware accelerator that issues and tracks hundreds of asynchronous memory operations, decoupling request issuance from response handling.
  • It integrates with out-of-order processors using dedicated scratchpad memory, coroutine-based frameworks, and new non-blocking instructions to optimize memory-level parallelism.
  • Empirical evaluations show that AMUs yield multi-fold speedups and sustain high throughput under high-latency conditions, effectively mitigating pipeline stalls.

An Async Memory Unit (AMU) is a hardware accelerator and supporting ISA/extensions for general-purpose out-of-order (OoO) processors, specifically targeting the latency, bandwidth, and parallelism challenges of far-memory technologies such as disaggregated DRAM and non-volatile main memory. Differing fundamentally from traditional blocking load/store pipelines, the AMU enables the processor to issue, track, and retire hundreds of asynchronous memory operations in parallel, decoupling request issuance from response handling. This architecture is usually complemented by dedicated scratchpad memory (SPM) and orchestrated by coroutine-based programming frameworks, thus enabling efficient latency hiding and scaling of memory-level parallelism (MLP) for memory-bound workloads (Jiang et al., 19 Nov 2025, Wang et al., 2024, Wang et al., 2021).

1. Motivation and System-Level Rationale

Far memory systems—DRAM/NVM located remotely or accessed through high-latency interconnects (e.g., CXL, Gen-Z)—present latencies in the hundreds of nanoseconds to several microseconds, with substantial variability (σ\sigma on the order of μ\mus). In conventional OoO cores, blocking load/store instructions tie up critical resources such as ROB entries and MSHRs. As far-memory latencies increase, effective MLP saturates quickly (typically plateauing at 20–64, the number of MSHRs), leading to pipeline stalls and IPC collapse. AMU architectures are developed to overcome these bottlenecks: they allow asynchronous requests to be issued and tracked independently, retiring instructions immediately and freeing up core resources for continued execution. Using this approach, AMUs have demonstrated sustained MLP far exceeding legacy limits (e.g., \sim130 outstanding requests at 5 μs latency) and have achieved multi-fold speedups on representative workloads (Wang et al., 2024, Wang et al., 2021).

2. High-Level AMU Architecture and Microarchitecture

All modern AMU implementations share these architectural principles:

  • Positioning: AMU is integrated alongside L1/L2 cache hierarchies, typically repurposing a segment of L2 cache as SPM for buffering data/metadata (Jiang et al., 19 Nov 2025, Wang et al., 2024).
  • Hardware Structures:
    • Request Table (RT)/ARQ/AMART: Hardware-maintained tables resembling MSHRs, where each entry tracks request ID, addresses, state, and associated coroutine metadata.
    • Finished Queue (FQ): FIFO structure for completed request IDs.
    • Metadata/ID Lists: Free-ID and finished-ID structures for flow control.
    • Bafin Prediction Table (BPT)/BTQ: In coroutine-centric AMUs, these store resume targets for memory-guided branch prediction.
  • Pipeline Integration: AMU instructions are decoded by a dedicated asynchronous load/store unit (ALSU) and interact with an ASMC (controller) handling SPM, request splitting, and response management (Wang et al., 2024).

The following table summarizes key architectural elements across major proposals:

Paper SPM Location Outstanding Requests Special Features
(Jiang et al., 19 Nov 2025) (CoroAMU) Partial L2 cache 16–32 RT entries Coroutine-specific branch prediction, aggregated requests
(Wang et al., 2024) Private L2 cache \sim130 Vector-batched ID management, speculative execution
(Wang et al., 2021) On-chip SPM 64 ARQ entries Minimal ISA, fence_async for ordering

3. Instruction Set Extensions and Software Interfaces

AMUs introduce new non-blocking instructions in the ISA:

  • aload/async_load: Issues a read request from far memory into SPM. Returns an ID for completion tracking.
  • astore/async_store: Issues a write to far memory from SPM.
  • getfin: Polls for a finished request; delivers its ID.
  • aset, aconfig, await, asignal: Higher-level primitives in coroutine frameworks to aggregate requests, configure handler address space, and synchronize (Jiang et al., 19 Nov 2025).
  • cfgrw/cfgrr: Configures granularity, SPM queue base, queue length (Wang et al., 2024).

The software interface is further enhanced by compiler passes (LLVM AsyncMarkPass, AsyncSplitPass) and coroutine-based constructs (co_await aload(), co_await astore()), which annotate loops and asynchronous accesses, easing programmer burden and ensuring correctness in the presence of hundreds of concurrent requests (Jiang et al., 19 Nov 2025, Wang et al., 2024). Software-level memory disambiguation (hash-table tracking of in-flight accesses) is used to avoid SPM aliasing conflicts (Wang et al., 2024).

4. Decoupled Operation Semantics and Coroutine Integration

AMUs enforce a strict decoupling of request issuance and response handling:

  • Issuance: RT/ARQ entries are allocated for requests, tagged with IDs, and dispatched immediately. The issuing instruction retires from the pipeline without waiting for data.
  • Completion: Responses are tracked by decrementing counters in the request entry. Once a complete group (ID) returns, it is enqueued in FQ/Finished-ID list.
  • Polling/Branching: Software or hardware polls for completed IDs (getfin) or jumps to coroutine resume points (bafin) using metadata encoded during issuance (Jiang et al., 19 Nov 2025).
  • Context Minimization and Aggregation: Compiler passes minimize coroutine context, aggregate loads/stores, and exploit locality within request subgroups.

For coroutines, AMU integration supports memory-guided branch prediction where each suspension point encodes the resume-PC in the request, and on completion, the AMU triggers a zero-bubble indirect jump to correctly resume the corresponding coroutine (Jiang et al., 19 Nov 2025). In high-level C++ frameworks, this is abstracted via event loops managing task scheduling and completion (Wang et al., 2024).

5. Performance Modeling and Mathematical Analysis

AMU performance is quantitatively modeled by overlap- and queue-based models:

  • Latency-Hiding Model (Jiang et al., 19 Nov 2025):

    Leff=L×(1α)+CL_\mathrm{eff} = L \times (1 - \alpha) + C

    where LL is average memory latency, CC is context switch cost, α\alpha is the fraction of memory latency overlapped by parallel coroutines (0α10 \leq \alpha \leq 1).

  • MLP Requirement Model (Wang et al., 2024):

    MLfW×(1fm)M \geq \frac{L_f}{W \times (1-f_m)}

    where LfL_f is far memory latency, WW is instruction window (ROB size), fmf_m is memory op fraction.

  • Speedup Scaling (Wang et al., 2024): For MM in-flight requests and per-request overhead tot_o,

    S(M,Lf)M1+Mto/LfS(M, L_f) \approx \frac{M}{1 + M t_o / L_f}

  • Queueing Model for Buffer Sizing (Wang et al., 2021): Asynchronous request queue depth KK with Preject<106P_{reject} < 10^{-6} for K=64K=64, utilization ρ=λ/μ\rho = \lambda/\mu.

The key principle is maximizing outstanding requests MM to approach linear speedup proportional to MM at high latencies, provided SPM and hardware resource sizes are provisioned appropriately.

6. Empirical Evaluation and Benchmark Results

Evaluation of AMUs spans synthetic and real-world memory-bound benchmarks (GUPS, BS, HJ, STREAM, BFS, mcf, lbm, IS, Redis/YCSB) under cycle-accurate simulation and FPGA prototyping:

  • CoroAMU (Jiang et al., 19 Nov 2025):
    • At 200 ns latency: 3.39× average speedup (up to 29.0× for GUPS)
    • At 800 ns latency: 4.87× average speedup (up to 59.8× for GUPS)
    • Overhead for scheduling and instruction management reduced to ~3.91×
    • Sustained MLP ≈ 64 in latency-bound apps
  • AMI/AMU (Wang et al., 2024):
    • At 1 μs: 2.42× geometric mean speedup; GUPS achieves 26.86× at 5 μs and ~130 outstanding requests
    • Baseline IPC collapses to 0.2; AMU IPC remains >1.5
    • SPM >32 KB yields diminishing returns; disabling speculative ID batching reduces speedup by 20 %
    • Dynamic power +10 %, but overall energy reduced by 10 % due to shortened execution
  • Classic AMU (Wang et al., 2021):
    • Streaming kernels up to 2.3× IPC improvement; memory-stall cycles reduced by 40–75 %
    • Area/power overhead modest: ~2 % area, ~3 % power increase

These results demonstrate that AMU architectures substantially raise MLP, sustain high throughput, and mitigate performance collapse in far-memory applications.

7. Trade-offs, Limitations, and Prospects

Key trade-offs include hardware overhead (additional logic for RT/ARQ/BPT/queue structures, modest SPM partitioning) and the complexity of compiler/runtime code generation. Some designs minimize context-switch cost via LLVM passes, while others aggregate requests to reduce scheduling frequency (Jiang et al., 19 Nov 2025). Software-layer solutions (disambiguation tables, event loops) are necessary due to the lack of CAM-based LSQ in SPM (Wang et al., 2024).

Limitations arise for pointer-heavy workloads with low spatial locality (limited aggregation potential, lower MM), MC bandwidth bottlenecks in multi-core environments, and overheads in software for SPM allocation and conflict avoidance. Open directions include dynamic granularity tuning, richer message-based memory interfaces, and hierarchical AMUs for multi-level memory systems (Wang et al., 2021).

A plausible implication is that as memory systems become increasingly heterogeneous and remote, AMU-like designs will be essential for practical scaling of general-purpose processors in data-intensive contexts, especially where coroutine-based asynchronous execution is applicable.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Async Memory Unit (AMU).