H2M2: Hardware-based Heterogeneous Memory Management

Updated 21 January 2026

H2M2 is a framework that integrates diverse memory types with hardware primitives and device-side logic to optimize allocation and migration.
It employs dynamic profiling and detection mechanisms, such as Count-Min Sketches and bitmaps, to guide adaptive data placement.
H2M2 enhances performance and energy efficiency in datacenter, embedded, and AI systems with specialized MMUs and migration engines.

Hardware-based Heterogeneous Memory Management (H2M2) encompasses architectural and algorithmic mechanisms that enable memory systems composed of multiple, physically and technologically diverse memory types to be managed, scheduled, and accessed in a transparent and performant manner. H2M2 frameworks span device-side logic, MMU organization, address mapping, OS/hypervisor interfaces, and hardware/firmware for page migration, delivering abstracted, adaptive, and efficient allocation and migration in datacenter, embedded, and accelerator-rich environments. Representative technologies include CXL tiered memory, DRAM/NVM hybrid systems, asymmetric bandwidth/capacity memory for ML inference, and in-memory compute substrates.

1. Architectural Principles and Domain-Specific Designs

H2M2 combines multiple hardware primitives and abstractions to effectively leverage the heterogeneity of modern and emerging memory systems. Architecturally, H2M2 often interposes new hardware below, or alongside, the traditional CPU-controlled MMU, supporting combinations such as: capacity/bandwidth asymmetric DRAM+NVM, compute-attached and off-host accelerator memory, and hierarchical or parallel memory organizations.

Examples include:

Device-side HeteroMem logic for CXL-based tiered memory, comprising metadata remap units, hot/cold profiling, and hardware migration engines, fully abstracting heterogeneous regions from the CPU (Chen et al., 26 Feb 2025).
Processing-in-memory (PIM) H2M2 architectures with a hardware Data Copy Engine (DCE), memory-mapping units, and PIM-specific scheduling logic for autonomous DRAM↔PIM transfers, eliminating CPU bottlenecks (Lee et al., 2024).
Accelerator-attached H2M2 with distributed MMU/TLB components (e.g., LL/2-level TLB + multithreaded PTW) for transparent address translation and efficient data migration in GPU and AI chips (Kim et al., 2017, Hwang et al., 21 Apr 2025).
HMMU in mobile/embedded hybrid memory platforms, with internal page tables, sub-page "caches," background DMA migration, and adaptive granularity management (Wen et al., 2020).

These designs converge on critical themes: device- or memory-side control; hardware support for hot/cold/tier-aware detection and migration; programmable or dual-mode address space mapping; and operating system decoupling, i.e., full transparency for host software or minimal, policy-driven OS involvement.

2. Hardware Building Blocks and Address Mapping

A central motif in H2M2 is the hardware realization of memory mapping and migration logic. For instance, in HeteroMem, the FPGA-resident abstraction layer holds a page-level remap table, ping-pong bitmaps, and Count-Min Sketches for identifying hot and cold data. It services host requests through a multi-stage pipeline: remap lookup, translation, access, and (if necessary) migration. The migration unit exchanges designated hot/cold page pairs between memory tiers, rate-limited to cap hardware utilization overhead (Chen et al., 26 Feb 2025).

PIM-MMU/H2M2 provides a data copy engine (DCE) with direct memory access capabilities, supported by a heterogeneity-aware memory mapping unit (HetMap) that selects memory mappings based on region and access type (MLP-centric for DRAM, locality-centric for PIM). The PIM-aware memory scheduler (PIM-MS) interleaves transfers to exploit maximum DRAM parallelism and avoid conflicts, maintaining high throughput without CPU mediation (Lee et al., 2024). In AI inference systems, H2M2 abstractions install MMU+TLB components locally at both HBM and LPDDR chips, allowing dynamic page remapping in support of load-balanced kernel scheduling (Hwang et al., 21 Apr 2025).

Hybrid memory managers (e.g., HMMU) maintain an internal page table (PTT), metadata on sub-page access, and rely on finite-state logic to determine whether accesses should hit a DRAM cache, trigger on-demand migration, or directly access slower NVM backends (Wen et al., 2020).

3. Dynamic Profiling, Hot/Cold Page Identification, and Migration

H2M2 frameworks universally rely on dynamic, run-time profiling to guide migration and data placement. Notable hardware mechanisms include:

PEBS-based online access tracking, which samples memory loads at hardware events (e.g., L2 misses), generating per-page access histograms at <10% overhead even at 100K+ core scales, suitable for driving policy decisions for migration between memory types (Nonell et al., 2020).
HeteroMem’s hardware profiling with Count-Min Sketches (for hotness) and ping-pong bitmaps (for identifying coldness), triggering pairwise migration with bounded hardware overhead (Chen et al., 26 Feb 2025).
In HMMU, sub-page utilization bitmaps and counters adapt the granularity and aggressiveness of migrations, with both full-page and mini-block (128 B) management supported in hardware (Wen et al., 2020).
Policy- and access-pattern-aware page allocation, as seen in vertical memory management, uses hardware-exposed address bits for bank/LLC "coloring" and software counters in the allocation path (Liu, 2017).

Profiling and migration actions are either device-initiated or are exposed through a thin runtime interface, often bypassing OS-level page faults entirely. Policies can be tuned dynamically; for example, in LLM inference, migration is re-computed on application-visible events (e.g., batch change), supporting adaptive load balancing in parallel HBM/LPDDR memory mappings (Hwang et al., 21 Apr 2025).

4. Scheduling, Placement Optimization, and Policy Frameworks

H2M2 leverages both heuristic and analytical models to optimize page placement, memory-level parallelism, and migration schedules:

PIM-MS’s Algorithm 1 issues DRAM/PIM transfers in a bank- and channel-aware nested loop, maximizing parallelism while guaranteeing mutual exclusion and conflict minimization (Lee et al., 2024).
Placement objectives are commonly formalized as constrained optimization problems that balance throughput and energy (e.g., maximizing Σp (T_DRAM(p)·xₚ + T_PIM(p)·(1–xₚ)) subject to capacity constraints) (Lee et al., 2024).
Accelerator MMUs adapt TLB/PTW allocation to the workload, balancing translation latency, hit rates, and area; two-level TLBs and multi-threaded PTWs are sized according to observed access patterns, with per-domain API support for resource allocation (Kim et al., 2017).
Vertical/horizontal partitioning frameworks utilize coloring and dynamic decision trees, classifying workloads by cache affinity or bank-level behavior and switching policies on-the-fly (e.g., A-VP, B-VP, random) (Liu, 2017).
In hybrid memories, migration thresholds such as T_block, u_high/u_low are dynamically adjusted to optimize migration aggressiveness versus overhead, leading to statistically optimal trade-offs in energy, performance and NVM wear (Wen et al., 2020).

5. Performance, Energy, and Scalability Evaluation

H2M2 architectures consistently demonstrate substantial throughput, energy, and QoS benefits relative to software- or CPU-managed baselines:

Device-side management in HeteroMem yields a 5.1%–16.2% geomean speedup over host-driven tiering, with <2% added access latency, <5% cycle overhead for migration, and >95% hot page placement within 50ms post-launch (Chen et al., 26 Feb 2025).
DRAM↔PIM transfer bandwidth and energy efficiency improve 4.1× (average) and up to 6.9× (peak) over CPU-driven memcpy, leading to end-to-end PrIM speedups of 2.2× (Lee et al., 2024).
In LLM inference, asymmetric two-chip H2M2 designs achieve 1.46–2.94× speedup across benchmark LLMs with <5% mapping/migration overhead, using high-bandwidth memory judiciously and approaching 97% of the oracle ideal (Hwang et al., 21 Apr 2025).
HMMU hardware mechanisms reduce energy by ~40% vs all-DRAM and cut NVM writes by 20–86% compared to static placement, with only a 12% performance loss relative to an unattainable all-DRAM baseline (Wen et al., 2020).
Vertical partitioning policies in x86 DRAM/LLC systems yield up to 11% weighted IPC improvement, particularly for workloads dominated by cache thrashing or memory contention (Liu, 2017).
Hardware-based profiling incurs mean overheads of 1–2% (up to 10% worst-case), scaling robustly to over 128K cores with lock-free, per-core design (Nonell et al., 2020).

Hardware cost analysis indicates moderate resource utilization: for example, DCE and scheduling logic constitute ≈0.4% additional CPU die area in PIM-MMU/H2M2 (Lee et al., 2024), and moderate FPGA resource footprints in CXL-based systems (Chen et al., 26 Feb 2025).

6. Limitations, Challenges, and Prospects

H2M2 systems face specific limitations and open challenges:

Fixed or coarse-grained split points (e.g., only two distinctly mapped regions between DRAM and PIM); multi-tier or more complex hierarchies require richer mapping mechanisms (Lee et al., 2024).
Device-side management typically presumes latency/bandwidth uniformity across banks; heterogeneity (asymmetric banks, CXL cascades, etc.) necessitates per-bank or weighted scheduling (Lee et al., 2024, Chen et al., 26 Feb 2025).
OS-level policy is often heuristic or static; integration of online profiling, dynamic runtime feedback, and hardware-accelerated placement accelerators remains an area of active work (Nonell et al., 2020, Lee et al., 2024).
The granularity of allocation or migration may be coarse, necessitating deeper buffers or finer-grained metadata to accommodate small page or sub-page operations (Wen et al., 2020, Lee et al., 2024).
Scalability to systems with thousands of banks or channels could stress single-DMA or copy engines, motivating distributed or tiled H2M2 logic (Lee et al., 2024).
Placement and migration policies must accommodate wear-leveling and endurance for NVM technologies, as well as application phase changes (Wen et al., 2020, Liu, 2017).
Many frameworks are currently limited to prototype or simulation; widespread commodity adoption will likely depend on standardized architectural interfaces and further reduction in hardware overhead.

H2M2 continues to evolve with the adoption of CXL-based tiered memory systems, data-centric acceleration (LLM/AI/graph applications), and system architectures where the CPU is no longer the single point of resource orchestration, demanding flexible, scalable, and high-performance hardware-managed memory subsystems.