Scaling Intelligence: Designing Data Centers for Next-Gen Language Models

Published 17 Jun 2025 in cs.AR, cs.AI, cs.DC, cs.ET, and cs.PF | (2506.15006v3)

Abstract: The explosive growth of LLMs, such as GPT-4 with 1.8 trillion parameters, demands a fundamental rethinking of data center architecture to ensure scalability, efficiency, and cost-effectiveness. Our work provides a comprehensive co-design framework that jointly explores FLOPS, HBM bandwidth and capacity, multiple network topologies (two-tier vs. FullFlat optical), the size of the scale-out domain, and popular parallelism/optimization strategies used in LLMs. We introduce and evaluate FullFlat network architectures, which provide uniform high-bandwidth, low-latency connectivity between all nodes, and demonstrate their transformative impact on performance and scalability. Through detailed sensitivity analyses, we quantify the benefits of overlapping compute and communication, leveraging hardware-accelerated collectives, widening the scale-out domain, and increasing memory capacity. Our study spans both sparse (mixture of experts) and dense transformer-based LLMs, revealing how system design choices affect Model FLOPS Utilization (MFU = Model FLOPS per token * Observed tokens per second / Peak FLOPS of the hardware) and overall throughput. For the co-design study, we utilized an analytical performance modeling tool capable of predicting LLM runtime within 10% of real-world measurements. Our findings offer actionable insights and a practical roadmap for designing AI data centers that can efficiently support trillion-parameter models, reduce optimization complexity, and sustain the rapid evolution of AI capabilities.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a comprehensive co-design framework that integrates compute, memory, network topology, and algorithmic parallelism to efficiently train trillion-parameter LLMs.
The study demonstrates that FullFlat optical architectures can achieve up to 70× throughput improvements and reduce TCO by 20–30% compared to traditional two-tier networks.
The extended Calculon tool accurately predicts LLM performance within a 10% margin, enabling optimal system configurations across thousands of design points.

Scaling Intelligence: Data Center Co-Design for Next-Gen LLMs

Introduction

The emergence of multi-trillion parameter LLMs, such as GPT-4 (1.8T parameters), demands a radical re-examination of AI data center design. The escalating computational, memory, and networking requirements place pressure on current infrastructure, where Model FLOPS Utilization (MFU) often falls below 50%. The paper "Scaling Intelligence: Designing Data Centers for Next-Gen LLMs" (2506.15006) delivers a comprehensive co-design framework and analytical evaluation—jointly considering compute, bandwidth, memory, network topology, and algorithmic parallelism—targeted at enabling efficient, cost-optimal training of state-of-the-art and future LLMs.

Analytical Framework and Methodology

The study introduces an enhanced analytical performance model, extending the Calculon tool to accurately project runtimes and resource sensitivities for both dense (e.g., GPT-3) and sparse Mixture of Experts (MoE, e.g., GPT-4) architectures. The tool models:

Application parameters: embedding, sequence size, hidden/attention dimensions, batch sizes.
System parameters: GPU compute (FP4–FP16), HBM/Tier-2 memory capacity and bandwidth, network link BW/latency.
Implementation/optimizations: parallel strategies (DP, PP, TP, EP, SP, ES), recompute/offloading, kernel fusion, collective operations (all-reduce/all-to-all), overlapping compute and communication.

This simulator achieves a prediction accuracy within 10% of real-world measurements on modern LLM clusters, and allows exhaustive exploration of thousands of hardware-software design points.

LLM Parallelism and Optimization Landscape

The explosive size of LLMs necessitates intricate, multi-axis parallelism strategies—data, pipeline, tensor, sequence, and expert parallelism/sharding—each mapped to different bandwidth/memory requirements and collective communication patterns.

Key findings:

MoE models induce significant all-to-all communication overhead for expert routing (Figure 1), contrasting with the higher arithmetic intensity of dense models.
The choice and interaction of optimization parameters yield a vast, non-trivial design space—suboptimal configurations may result in >80% performance degradation compared to the optimal point (Figure 2).
Figure 2: Running LLMs with suboptimal parameter configurations on different system configurations; FullFlat enables substantially higher MFU and is less sensitive to misconfiguration.

This parameter sensitivity underscores the vital importance of analytical tools for early-stage exploration and system design.

Network Topology: Two-Tier vs. FullFlat Optical

The study systematically compares conventional two-tier networks (high-bandwidth domains, HBDs—e.g., NVLink, XeLink, Infinity Fabric—combined with low-bandwidth scale-out) to advanced all-optical, FullFlat architectures (Figure 3). FullFlat leverages high-radix, low-diameter, co-packaged optics meshes (HyperX, PolarFly), delivering uniform high-throughput and low-latency across all nodes.

Figure 4: Network topologies: 2D HyperX, 2D HyperX with attached GPUs, and Polarfly, underlying future FullFlat architectures.

Core claims:

FullFlat topologies consistently deliver higher throughput and better scaling, especially when expert parallel communication fits fully within HBD.
FullFlat is less sensitive to missing software optimizations (e.g., compute/communication overlap, hardware collectives), with only a 5% gap between top parameter configurations versus up to 80% in two-tier systems.
FullFlat reduces switch count, hop overhead, and can decrease TCO by 20–30%, with anticipated additional gains in reliability, resiliency, and serviceability.
Two-tier enhancements (such as HBDs expanding from 8 to 576 GPUs with high BWs in future NVLink generations) can delay, but not fundamentally resolve, scale-out communication bottlenecks for large MoE models.

Empirical Sensitivity Results

Strong Scaling and Throughput

The analysis reveals substantial gains in throughput and scaling with increasing system size and advanced architectures:

Scaling GPT-4 (1.8T param) from today’s TwoTier-HBD8 to FullFlat raises throughput 50–70× at 4K GPUs (Figure 5), with optimal scaling observed to 16K–32K GPUs for large models.
Communication and collective overheads (especially for MoE all-to-all) dominate for insufficient HBD sizes or limited scale-out bandwidth, at both strong and weak scaling points.

Figure 1: Strong scaling of GPT-4 on different system configurations; FullFlat and advanced two-tier architectures unlock significantly higher throughput.

Hardware-Accelerated Collectives & Comp-Comm Overlap

Absence of hardware collectives (versus software) imposes 10–16% slowdowns at scale, while missing compute/communication overlap drives further degradation (up to 15% for MoEs; >40% for dense models at large node counts).
FullFlat architectures are substantially more robust to these software omissions, decoupling system utilization from the need for intricate low-level tuning.

Memory Bandwidth and Capacity

HBM bandwidth increases (up to 30–48TB/s per GPU) yield 3–4.5× throughput gains for MoEs and 2.6× for dense LLMs.
Larger HBM capacity (toward 1.3TB/GPU, Figure 6) directly lowers the minimal degree of parallelism, eliminates recomputation/offloading, and boosts achievable MFU, especially for trillion-parameter models.

Figure 3: Performance impact of increasing HBM memory capacity; throughput gain plateaus beyond the point where a model fully fits resident in GPU memory.

Impact Factor Ranking

Prioritization for ROI in data center investment must balance:

Compute FLOPS, Scale-up bandwidth, and HBM bandwidth as primary drivers.
Adequate HBD to absorb the expert/tensor parallel domains, especially in MoEs.
Scale-out BW remains necessary as system size and EP increase, but with diminishing marginal returns past a threshold where key communications remain within HBD.
Software optimizations (collective overlap, hardware collectives) are less critical with FullFlat, but must be addressed in two-tier systems.

FullFlat Optical: Systemic Implications

FullFlat optical networks represent a paradigm shift in data center architecture for AI:

They enable previously impractical optimization regimes, such as large-scale tensor/expert parallelism across the entire cluster.
Achieve high MFU (Figure 7) and dramatically reduce the sensitivity of performance to configuration error or software inefficiencies.
Design simplification: removal of strict scale-up/out boundaries and increased flexibility/resiliency in workload placement and hardware maintenance.
Figure 5: Scaling of compute efficiency/MFU with number of GPUs for FullFlat configuration. Utilization remains high even at scale.

Dense (GPT-3) vs. MoE Workloads

Dense models (e.g., GPT-3-175B) exhibit higher arithmetic intensity and lower networking sensitivity, but become significantly less tolerant to missing compute/comm overlap and hardware collectives (>40% drop at scale).
MoEs, while more network-intensive, are comparatively buffered from these software inefficiencies within FullFlat or high-HBD systems; their main bottleneck is the availability of sufficient intra-expert BW.

Practical Guidelines and Future Data Center Design

According to model/application projections:

Next-gen systems should target ≥1.6TB/s scale-up and ≥200GB/s scale-out BW, 20 PF16 per GPU, 1.3TB HBM, and 256GB/s tier-2 memory BW.
FullFlat (CPO-based) topologies with 64–1,024 node HBDs and optical mesh interconnects are essential for holistic utilization at million-GPU scale.
Software frameworks must be equipped for rapid, static, ridgeline/impact factor analysis, and support all relevant parallelism axes and overlap strategies out-of-the-box.

Conclusion

The architectural requirements for next-generation LLMs effectively mandate a systemic co-design approach—one that coordinates compute, memory, network, and software tactics. The analysis in (2506.15006) makes clear that:

FullFlat optical data centers fundamentally improve throughput, utilization, operational simplicity, and TCO for trillion-parameter LLMs.
As model size and algorithmic diversity grow, the resiliency, programmability, and MFU delivered by FullFlat and related network topologies will become indispensable.
The co-design and sensitivity methodology empowers early-phase investment and design decisions, revealing not only critical bottlenecks but also identifying clear stop-points for incremental resource additions, maximizing ROI.

Thus, the practical, theoretical, and operational roadmap to support future LLMs rests on balanced investments in compute, bandwidth, optical connectivity, and robust co-design tooling, ensuring data centers remain scalable, efficient, and future-proof.

Markdown Report Issue