Async Memory Unit (AMU) Overview
- Async Memory Unit (AMU) is a hardware accelerator that issues and tracks hundreds of asynchronous memory operations, decoupling request issuance from response handling.
- It integrates with out-of-order processors using dedicated scratchpad memory, coroutine-based frameworks, and new non-blocking instructions to optimize memory-level parallelism.
- Empirical evaluations show that AMUs yield multi-fold speedups and sustain high throughput under high-latency conditions, effectively mitigating pipeline stalls.
An Async Memory Unit (AMU) is a hardware accelerator and supporting ISA/extensions for general-purpose out-of-order (OoO) processors, specifically targeting the latency, bandwidth, and parallelism challenges of far-memory technologies such as disaggregated DRAM and non-volatile main memory. Differing fundamentally from traditional blocking load/store pipelines, the AMU enables the processor to issue, track, and retire hundreds of asynchronous memory operations in parallel, decoupling request issuance from response handling. This architecture is usually complemented by dedicated scratchpad memory (SPM) and orchestrated by coroutine-based programming frameworks, thus enabling efficient latency hiding and scaling of memory-level parallelism (MLP) for memory-bound workloads (Jiang et al., 19 Nov 2025, Wang et al., 2024, Wang et al., 2021).
1. Motivation and System-Level Rationale
Far memory systems—DRAM/NVM located remotely or accessed through high-latency interconnects (e.g., CXL, Gen-Z)—present latencies in the hundreds of nanoseconds to several microseconds, with substantial variability ( on the order of s). In conventional OoO cores, blocking load/store instructions tie up critical resources such as ROB entries and MSHRs. As far-memory latencies increase, effective MLP saturates quickly (typically plateauing at 20–64, the number of MSHRs), leading to pipeline stalls and IPC collapse. AMU architectures are developed to overcome these bottlenecks: they allow asynchronous requests to be issued and tracked independently, retiring instructions immediately and freeing up core resources for continued execution. Using this approach, AMUs have demonstrated sustained MLP far exceeding legacy limits (e.g., 130 outstanding requests at 5 μs latency) and have achieved multi-fold speedups on representative workloads (Wang et al., 2024, Wang et al., 2021).
2. High-Level AMU Architecture and Microarchitecture
All modern AMU implementations share these architectural principles:
- Positioning: AMU is integrated alongside L1/L2 cache hierarchies, typically repurposing a segment of L2 cache as SPM for buffering data/metadata (Jiang et al., 19 Nov 2025, Wang et al., 2024).
- Hardware Structures:
- Request Table (RT)/ARQ/AMART: Hardware-maintained tables resembling MSHRs, where each entry tracks request ID, addresses, state, and associated coroutine metadata.
- Finished Queue (FQ): FIFO structure for completed request IDs.
- Metadata/ID Lists: Free-ID and finished-ID structures for flow control.
- Bafin Prediction Table (BPT)/BTQ: In coroutine-centric AMUs, these store resume targets for memory-guided branch prediction.
- Pipeline Integration: AMU instructions are decoded by a dedicated asynchronous load/store unit (ALSU) and interact with an ASMC (controller) handling SPM, request splitting, and response management (Wang et al., 2024).
The following table summarizes key architectural elements across major proposals:
| Paper | SPM Location | Outstanding Requests | Special Features |
|---|---|---|---|
| (Jiang et al., 19 Nov 2025) (CoroAMU) | Partial L2 cache | 16–32 RT entries | Coroutine-specific branch prediction, aggregated requests |
| (Wang et al., 2024) | Private L2 cache | 130 | Vector-batched ID management, speculative execution |
| (Wang et al., 2021) | On-chip SPM | 64 ARQ entries | Minimal ISA, fence_async for ordering |
3. Instruction Set Extensions and Software Interfaces
AMUs introduce new non-blocking instructions in the ISA:
- aload/async_load: Issues a read request from far memory into SPM. Returns an ID for completion tracking.
- astore/async_store: Issues a write to far memory from SPM.
- getfin: Polls for a finished request; delivers its ID.
- aset, aconfig, await, asignal: Higher-level primitives in coroutine frameworks to aggregate requests, configure handler address space, and synchronize (Jiang et al., 19 Nov 2025).
- cfgrw/cfgrr: Configures granularity, SPM queue base, queue length (Wang et al., 2024).
The software interface is further enhanced by compiler passes (LLVM AsyncMarkPass, AsyncSplitPass) and coroutine-based constructs (co_await aload(), co_await astore()), which annotate loops and asynchronous accesses, easing programmer burden and ensuring correctness in the presence of hundreds of concurrent requests (Jiang et al., 19 Nov 2025, Wang et al., 2024). Software-level memory disambiguation (hash-table tracking of in-flight accesses) is used to avoid SPM aliasing conflicts (Wang et al., 2024).
4. Decoupled Operation Semantics and Coroutine Integration
AMUs enforce a strict decoupling of request issuance and response handling:
- Issuance: RT/ARQ entries are allocated for requests, tagged with IDs, and dispatched immediately. The issuing instruction retires from the pipeline without waiting for data.
- Completion: Responses are tracked by decrementing counters in the request entry. Once a complete group (ID) returns, it is enqueued in FQ/Finished-ID list.
- Polling/Branching: Software or hardware polls for completed IDs (
getfin) or jumps to coroutine resume points (bafin) using metadata encoded during issuance (Jiang et al., 19 Nov 2025). - Context Minimization and Aggregation: Compiler passes minimize coroutine context, aggregate loads/stores, and exploit locality within request subgroups.
For coroutines, AMU integration supports memory-guided branch prediction where each suspension point encodes the resume-PC in the request, and on completion, the AMU triggers a zero-bubble indirect jump to correctly resume the corresponding coroutine (Jiang et al., 19 Nov 2025). In high-level C++ frameworks, this is abstracted via event loops managing task scheduling and completion (Wang et al., 2024).
5. Performance Modeling and Mathematical Analysis
AMU performance is quantitatively modeled by overlap- and queue-based models:
- Latency-Hiding Model (Jiang et al., 19 Nov 2025):
where is average memory latency, is context switch cost, is the fraction of memory latency overlapped by parallel coroutines ().
- MLP Requirement Model (Wang et al., 2024):
where is far memory latency, is instruction window (ROB size), is memory op fraction.
- Speedup Scaling (Wang et al., 2024): For in-flight requests and per-request overhead ,
- Queueing Model for Buffer Sizing (Wang et al., 2021): Asynchronous request queue depth with for , utilization .
The key principle is maximizing outstanding requests to approach linear speedup proportional to at high latencies, provided SPM and hardware resource sizes are provisioned appropriately.
6. Empirical Evaluation and Benchmark Results
Evaluation of AMUs spans synthetic and real-world memory-bound benchmarks (GUPS, BS, HJ, STREAM, BFS, mcf, lbm, IS, Redis/YCSB) under cycle-accurate simulation and FPGA prototyping:
- CoroAMU (Jiang et al., 19 Nov 2025):
- At 200 ns latency: 3.39× average speedup (up to 29.0× for GUPS)
- At 800 ns latency: 4.87× average speedup (up to 59.8× for GUPS)
- Overhead for scheduling and instruction management reduced to ~3.91×
- Sustained MLP ≈ 64 in latency-bound apps
- AMI/AMU (Wang et al., 2024):
- At 1 μs: 2.42× geometric mean speedup; GUPS achieves 26.86× at 5 μs and ~130 outstanding requests
- Baseline IPC collapses to 0.2; AMU IPC remains >1.5
- SPM >32 KB yields diminishing returns; disabling speculative ID batching reduces speedup by 20 %
- Dynamic power +10 %, but overall energy reduced by 10 % due to shortened execution
- Classic AMU (Wang et al., 2021):
- Streaming kernels up to 2.3× IPC improvement; memory-stall cycles reduced by 40–75 %
- Area/power overhead modest: ~2 % area, ~3 % power increase
These results demonstrate that AMU architectures substantially raise MLP, sustain high throughput, and mitigate performance collapse in far-memory applications.
7. Trade-offs, Limitations, and Prospects
Key trade-offs include hardware overhead (additional logic for RT/ARQ/BPT/queue structures, modest SPM partitioning) and the complexity of compiler/runtime code generation. Some designs minimize context-switch cost via LLVM passes, while others aggregate requests to reduce scheduling frequency (Jiang et al., 19 Nov 2025). Software-layer solutions (disambiguation tables, event loops) are necessary due to the lack of CAM-based LSQ in SPM (Wang et al., 2024).
Limitations arise for pointer-heavy workloads with low spatial locality (limited aggregation potential, lower ), MC bandwidth bottlenecks in multi-core environments, and overheads in software for SPM allocation and conflict avoidance. Open directions include dynamic granularity tuning, richer message-based memory interfaces, and hierarchical AMUs for multi-level memory systems (Wang et al., 2021).
A plausible implication is that as memory systems become increasingly heterogeneous and remote, AMU-like designs will be essential for practical scaling of general-purpose processors in data-intensive contexts, especially where coroutine-based asynchronous execution is applicable.
References:
- CoroAMU’s coroutine-driven AMU architecture (Jiang et al., 19 Nov 2025)
- AMI instruction set and massive parallelism via SPM (Wang et al., 2024)
- Foundational AMU microarchitecture in general-purpose cores (Wang et al., 2021)