Horizon-LM: Memory-Centric LLM Training
- Horizon-LM is a memory-centric architecture that redistributes persistent and transient state between CPU and GPU to optimize large language model training.
- It utilizes a CPU-master ∥ GPU-template paradigm with asynchronous parameter streaming via CUDA, eliminating per-layer GPU state and ensuring linear host memory growth.
- Horizon-LM enables efficient single-node training of LLMs up to 120B parameters, outperforming systems like ZeRO-3 with predictable memory usage and superior TFLOPS performance.
Horizon-LM is a memory-centric system architecture for large-language-model (LLM) training that redefines the distribution of persistent and transient state between CPUs and GPUs. Unlike conventional GPU-centric paradigms—which retain complete model replicas and autograd graphs within GPU memory—Horizon-LM treats host memory (RAM) as the authoritative parameter store, relegating the GPU to a stateless compute engine. This design enables single-node training of LLMs with parameter scales up to 120B on commodity hardware by bounding GPU memory usage to the per-layer (rather than total-model) footprint and by ensuring predictable, strictly linear memory growth in host RAM (Yuan et al., 4 Feb 2026).
1. System Architecture and Execution Model
At the core of Horizon-LM is the CPU-master ∥ GPU-template paradigm. All persistent state—model parameters (BF16), gradients (BF16), and Adam optimizer moments , (FP32)—resides in host RAM within a layer-contiguous “flat-tensor” layout. The GPU maintains only two fixed-size staging buffers used for parameter streaming and a small workspace for activations and checkpoints.
The training loop is orchestrated by a CPU scheduler using four core primitives:
| Primitive | Operation | Data Motion |
|---|---|---|
| StreamIn(i) | Asynchronous DMA transfer of to the GPU | Host GPU |
| Bind(i) | Attach to reusable operator template | Pointer update |
| Compute(i) | Execute forward or local backward on layer | GPU compute only |
| Offload(i) | Asynchronous DMA of back to host RAM | GPU Host |
Persistent per-layer GPU states are eliminated; no global autograd graph is stored on device. Activations are checkpointed every layers or recomputed as needed. Forward pass streams parameters layerwise, releasing buffers after compute, and backward pass proceeds in blockwise chunks, loading checkpoints and explicitly recomputing activations as required.
A pipelined, double-buffered execution engine leverages three dedicated CUDA streams:
- Compute (forward, recompute, local backward)
- H2D (asynchronous parameter prefetch)
- D2H (background gradient offload)
Buffer state is managed by lightweight CUDA event handshakes, ensuring that parameter streaming and gradient offload are overlapped with compute to maximize device utilization.
2. Memory Footprint Characterization
Host memory usage in Horizon-LM is bounded by the total parameter, gradient, and optimizer state footprint:
with a practical implementation overhead for slab management and page-pinning:
where is the total parameter count (in elements), is the overhead for pinned slabs, and is the largest per-layer parameter allocation.
GPU memory consumption is strictly bounded by:
where (double buffering), is a small constant, is the checkpoint interval, denotes maximal per-layer activation, and is miscellaneous workspace. Notably, this bound is depth-independent; model depth only proportionally impacts host memory footprint.
In contrast, offloading systems such as ZeRO-3 must repeatedly assemble large parameter shards and replicate optimizer state within host RAM, resulting in superlinear host usage that scales approximately as under fragmentation, hence substantially exceeding the $12P$ lower bound.
3. Empirical Performance and Scaling Results
Horizon-LM demonstrates its efficacy across diverse single-node configurations. On a single NVIDIA H200 (1.5 TB host RAM), it trains 120B-parameter models at sustained 250–270 TFLOPS without exhausting device memory. Competing systems (ZeRO-3, native PyTorch) are unable to proceed beyond 32B parameters. Host memory usage remains strictly linear—approximately 1.44 TB at 120B—whereas offloading baselines exceed 2.5 TB.
On a standard A100 (80 GB PCIe Gen4, 600 GB RAM):
| Model Scale | Horizon-LM (TFLOPS) | Gemini (TFLOPS) | ZeRO-3 (TFLOPS) | Relative Speedup (Horizon-LM vs. ZeRO-3) |
|---|---|---|---|---|
| 7B | 128 | 53 | 36 | 3.6× |
| 14B | 122 | 15 | 10 | 12.2× |
| 32B | 114 | OOM | OOM | — |
Numerical correctness is maintained: MetaMathQA evaluation for both 7B and 14B models shows exact-match accuracies of 89% and 92.5%, marginally exceeding offloading baselines. Scaling in width (up to 5×) and depth (up to 6×) results in no more than 35% throughput loss, a range where baselines fail due to memory exhaustion.
4. Implementation Details and Bottlenecks
Horizon-LM’s host RAM is managed via layer-contiguous tiling: parameter triplets are packed into single-page-aligned memory blocks. Only two page-locked slabs are allocated, precluding growth of pinned memory. Back-pressure from gradient slabs (with a default ) limits memory use.
Communication is fully asynchronous, using cudaMemcpyAsync over three streams and torch.cuda.Event for event synchronization. Optimization runs on the CPU using DeepSpeed’s SIMD-accelerated CPUAdam, with OpenMP threads affinitized to NUMA domains. A C++/CUDA extension enables batch parameter binding, reducing Python dispatch overhead by approximately 10%.
Two limitations are observed:
- For each layer, streaming bandwidth must not exceed previous layer compute time ; extremely wide, shallow layers with low compute-to-transfer ratio may become transfer-bound.
- The host CPU remains on the critical path for parameter updates. Insufficiently parallel CPUs or large models can create an “optimizer wall,” limiting effective scaling.
5. Redefining LLM Training Feasibility Boundaries
Horizon-LM decouples maximum scalable model size from GPU local memory—only per-layer data need fit on device. As a result, single-node, single-GPU training becomes viable for LLMs comprising hundreds of billions of parameters, provided host RAM is sufficient and per-layer width is within GPU capacity. This enables post-training workloads—such as instruction tuning, alignment, and domain adaptation—previously hindered by distributed system complexity and memory unpredictability.
The boundary for large-model training on commodity and specialist hardware now shifts from GPU local memory to aggregate host RAM and PCIe transfer characteristics. Realizing fully predictable, linear memory growth and eliminating the need for complex distributed execution, Horizon-LM establishes a new paradigm for resource-efficient, node-scale LLM optimization (Yuan et al., 4 Feb 2026).