Papers
Topics
Authors
Recent
Search
2000 character limit reached

Horizon-LM: Memory-Centric LLM Training

Updated 8 February 2026
  • Horizon-LM is a memory-centric architecture that redistributes persistent and transient state between CPU and GPU to optimize large language model training.
  • It utilizes a CPU-master ∥ GPU-template paradigm with asynchronous parameter streaming via CUDA, eliminating per-layer GPU state and ensuring linear host memory growth.
  • Horizon-LM enables efficient single-node training of LLMs up to 120B parameters, outperforming systems like ZeRO-3 with predictable memory usage and superior TFLOPS performance.

Horizon-LM is a memory-centric system architecture for large-language-model (LLM) training that redefines the distribution of persistent and transient state between CPUs and GPUs. Unlike conventional GPU-centric paradigms—which retain complete model replicas and autograd graphs within GPU memory—Horizon-LM treats host memory (RAM) as the authoritative parameter store, relegating the GPU to a stateless compute engine. This design enables single-node training of LLMs with parameter scales up to 120B on commodity hardware by bounding GPU memory usage to the per-layer (rather than total-model) footprint and by ensuring predictable, strictly linear memory growth in host RAM (Yuan et al., 4 Feb 2026).

1. System Architecture and Execution Model

At the core of Horizon-LM is the CPU-master ∥ GPU-template paradigm. All persistent state—model parameters θ\theta (BF16), gradients θ\nabla \theta (BF16), and Adam optimizer moments mm, vv (FP32)—resides in host RAM within a layer-contiguous “flat-tensor” layout. The GPU maintains only two fixed-size staging buffers used for parameter streaming and a small workspace for activations and checkpoints.

The training loop is orchestrated by a CPU scheduler using four core primitives:

Primitive Operation Data Motion
StreamIn(i) Asynchronous DMA transfer of θi\theta_i to the GPU Host \rightarrow GPU
Bind(i) Attach θi\theta_i to reusable operator template Pointer update
Compute(i) Execute forward or local backward on layer ii GPU compute only
Offload(i) Asynchronous DMA of θi\nabla \theta_i back to host RAM GPU \rightarrow Host

Persistent per-layer GPU states are eliminated; no global autograd graph is stored on device. Activations are checkpointed every KK layers or recomputed as needed. Forward pass streams parameters layerwise, releasing buffers after compute, and backward pass proceeds in blockwise chunks, loading checkpoints and explicitly recomputing activations as required.

A pipelined, double-buffered execution engine leverages three dedicated CUDA streams:

  • Compute (forward, recompute, local backward)
  • H2D (asynchronous parameter prefetch)
  • D2H (background gradient offload)

Buffer state is managed by lightweight CUDA event handshakes, ensuring that parameter streaming and gradient offload are overlapped with compute to maximize device utilization.

2. Memory Footprint Characterization

Host memory usage in Horizon-LM is bounded by the total parameter, gradient, and optimizer state footprint:

MCPUminP(Bθ+Bg+Bm)=P(2+2+8)=12P (bytes)M_{CPU}^{\min} \geq P \cdot (B_\theta + B_g + B_m) = P \cdot (2+2+8) = 12P \ \text{(bytes)}

with a practical implementation overhead for slab management and page-pinning:

MCPU12P+Sslab+O(Pmax)M_{CPU} \approx 12P + S_{slab} + O(P_{max})

where PP is the total parameter count (in elements), SslabS_{slab} is the overhead for pinned slabs, and PmaxP_{max} is the largest per-layer parameter allocation.

GPU memory consumption is strictly bounded by:

MGPUcpPmax+caKAmax+WGPUM_{GPU} \leq c_p \cdot P_{max} + c_a \cdot K \cdot A_{max} + W_{GPU}

where cp2c_p \approx 2 (double buffering), cac_a is a small constant, KK is the checkpoint interval, AmaxA_{max} denotes maximal per-layer activation, and WGPUW_{GPU} is miscellaneous workspace. Notably, this bound is depth-independent; model depth only proportionally impacts host memory footprint.

In contrast, offloading systems such as ZeRO-3 must repeatedly assemble large parameter shards and replicate optimizer state within host RAM, resulting in superlinear host usage that scales approximately as O(PlogP)O(P \log P) under fragmentation, hence substantially exceeding the $12P$ lower bound.

3. Empirical Performance and Scaling Results

Horizon-LM demonstrates its efficacy across diverse single-node configurations. On a single NVIDIA H200 (1.5 TB host RAM), it trains 120B-parameter models at sustained 250–270 TFLOPS without exhausting device memory. Competing systems (ZeRO-3, native PyTorch) are unable to proceed beyond 32B parameters. Host memory usage remains strictly linear—approximately 1.44 TB at 120B—whereas offloading baselines exceed 2.5 TB.

On a standard A100 (80 GB PCIe Gen4, 600 GB RAM):

Model Scale Horizon-LM (TFLOPS) Gemini (TFLOPS) ZeRO-3 (TFLOPS) Relative Speedup (Horizon-LM vs. ZeRO-3)
7B 128 53 36 3.6×
14B 122 15 10 12.2×
32B 114 OOM OOM

Numerical correctness is maintained: MetaMathQA evaluation for both 7B and 14B models shows exact-match accuracies of \approx89% and \approx92.5%, marginally exceeding offloading baselines. Scaling in width (up to 5×) and depth (up to 6×) results in no more than 35% throughput loss, a range where baselines fail due to memory exhaustion.

4. Implementation Details and Bottlenecks

Horizon-LM’s host RAM is managed via layer-contiguous tiling: parameter triplets (θi,θi,mi,vi)(\theta_i, \nabla \theta_i, m_i, v_i) are packed into single-page-aligned memory blocks. Only two page-locked slabs are allocated, precluding growth of pinned memory. Back-pressure from gradient slabs (with a default K=12K=12) limits memory use.

Communication is fully asynchronous, using cudaMemcpyAsync over three streams and torch.cuda.Event for event synchronization. Optimization runs on the CPU using DeepSpeed’s SIMD-accelerated CPUAdam, with OpenMP threads affinitized to NUMA domains. A C++/CUDA extension enables batch parameter binding, reducing Python dispatch overhead by approximately 10%.

Two limitations are observed:

  • For each layer, streaming bandwidth Pi/BPCIeP_i/B_{PCIe} must not exceed previous layer compute time Ti1compT_{i-1}^{comp}; extremely wide, shallow layers with low compute-to-transfer ratio may become transfer-bound.
  • The host CPU remains on the critical path for parameter updates. Insufficiently parallel CPUs or large models can create an “optimizer wall,” limiting effective scaling.

5. Redefining LLM Training Feasibility Boundaries

Horizon-LM decouples maximum scalable model size from GPU local memory—only PmaxP_{max} per-layer data need fit on device. As a result, single-node, single-GPU training becomes viable for LLMs comprising hundreds of billions of parameters, provided host RAM is sufficient and per-layer width is within GPU capacity. This enables post-training workloads—such as instruction tuning, alignment, and domain adaptation—previously hindered by distributed system complexity and memory unpredictability.

The boundary for large-model training on commodity and specialist hardware now shifts from GPU local memory to aggregate host RAM and PCIe transfer characteristics. Realizing fully predictable, linear memory growth and eliminating the need for complex distributed execution, Horizon-LM establishes a new paradigm for resource-efficient, node-scale LLM optimization (Yuan et al., 4 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Horizon-LM.