Papers
Topics
Authors
Recent
Search
2000 character limit reached

Joint Cooling & Computing Modeling

Updated 20 January 2026
  • Joint cooling and computing modeling is an interdisciplinary approach that integrates high-fidelity thermal simulations, workload-aware control, and optimization to enhance data center performance.
  • It couples physical thermal dynamics with computational resource management by using surrogate models, MILP, MPC, and reinforcement learning to balance energy use and cooling efficiency.
  • Applications range from hyperscale data centers to HPC clusters and 3DIC design, achieving significant energy savings, reduced hotspots, and improved system reliability.

Joint cooling and computing modeling approaches address the coupled interactions between information technology (IT) workloads and facility thermal management—capturing both the physics of heat generation/transfer and the control/optimization of computation and cooling resources. This interdisciplinary domain integrates high-fidelity thermal models, workload-aware control, surrogate data-driven models, optimization theory, and cyber-physical system implementation, supporting both research and operational efficiency in hyperscale data centers, AI supercomputing clusters, edge infrastructures, and 3DIC platforms.

1. Physical Principles and Modeling Paradigms

Joint cooling–computing modeling fundamentally couples three physical domains: (i) heat generation in active compute hardware, (ii) thermal transport via conduction, convection (air or liquid), and, where relevant, microfluidics, and (iii) heat rejection/absorption into the environment. Representative frameworks instantiate these couplings at diverse spatial and temporal scales.

  • Component- and System-Level Coupling: Systems such as the robust bottom-up framework by Lei et al. explicitly model server, storage, and network power as a function of workload and utilization, forwarding instantaneous power to a coupled thermodynamics-based cooling model. Cooling power is then determined by plant COP, climate conditions, and system design (CRAC units, economizers), integrated via Pcool(t)=(PUE(t)1)PIT(t)P_{cool}(t) = (\mathrm{PUE}(t) - 1)P_{IT}(t) or more precise COP and heat-rejection relationships (Lei, 2020).
  • Detailed Heat Flow Network Models: For HPC liquid cooling, high-fidelity digital twins (LC-Opt) model blade-, rack-, and facility-level flows, with per-blade lumped capacitance heat equations (CijdTijdtC_{ij} \frac{d T_{ij}}{dt}), coolant mass/momentum/energy balances, and empirical or first-principles heat exchanger/tower models. All thermal domains are coupled through fluid and mechanical subloops, supporting multi-point and site-scale interaction (Naug et al., 31 Oct 2025).
  • Microscale and Stack-Level (3DIC): In chip co-design, Cool-3D integrates cycle-accurate microarchitectural simulation (gem5) with time-synchronized power traces (McPAT/CACTI-3DD), then directly injects these into HotSpot’s 3D RC network, including microfluidic channel thermal resistances and Nusselt/Reynolds-based convective transfer (Wang et al., 10 Mar 2025).
  • Surrogate and Vision-Based Models: Where real-time operation precludes CFD, deep-learning surrogates voxelize the physical environment (with channels for power, cooling, and airflow) and learn f:R3×D×H×WR1×D×H×Wf:\mathbb{R}^{3 \times D \times H \times W} \rightarrow \mathbb{R}^{1 \times D \times H \times W} mappings to instantaneous temperature fields (Sarkar et al., 13 Nov 2025).

2. Mathematical Foundations and Control Formulations

All state-of-the-art frameworks formulate explicit joint optimization problems, often subject to both hard performance constraints and multi-objective trade-offs.

  • General Formulations: The canonical problem minimizes combined IT and cooling energy minxEIT(x)+Ecooling(x)\min_{x} E_{IT}(x) + E_{cooling}(x) over workload assignment, server utilization, and cooling-system control settings, subject to per-server and system-level thermal and QoS/SLA constraints (Rostami et al., 2023, Arroba et al., 2023, Lei, 2020). For instance,

minv,ρF(v)  +  i=1nPi(ρi) subject to i=1nρiD Ml(v,ρ)Tidle(TidleTbusy)ρl VLBvVUB,ρi{0,1}\begin{aligned} \min_{v, \rho}\quad & F(v)\;+\;\sum_{i=1}^n P_{i}(\rho_{i}) \ \text{subject to}~\quad & \sum_{i=1}^n \rho_i \ge D \ & M_{l}(v, \rho) \leq T_{\mathrm{idle}} - (T_{\mathrm{idle}} - T_{\mathrm{busy}})\rho_l \ & V_{LB} \leq v \leq V_{UB}, \quad \rho_i \in \{0, 1\} \end{aligned}

where F(v)F(v) parameterizes cooling power and MlM_l captures heat-recirc-informed inlet temperatures (Rostami et al., 2023).

  • Hierarchical and Multi-Timescale Optimization: Advanced approaches structure joint control across multiple horizons—cluster-level provisioning (tens of min), pool scheduling (few min), per-job frequency selection (seconds), and minute-scale cooling MPC. The coupling occurs via explicit ODE models for rack air/heat transfer,

θ˙s,i=1Cth(Φs,iρcp(θc,iθs,i)+mPGPU(f,u))\dot{\theta}_{s, i} = \frac{1}{C_{th}} \left( \Phi_{s,i} \rho c_p (\theta_{c,i} - \theta_{s,i}) + m P_{GPU}(f, u) \right)

with computational resource allocation entering as an exogenous disturbance into the thermal dynamics (Abera et al., 13 Jan 2026).

  • Convexity and Linearization: Linearization and ILP relaxation improve tractability (for instance, regressing nonlinear cooling and temperature constraints onto approximate affine mappings in workload + cooling variables) and support scalable heuristics (Rostami et al., 2023).
  • Metaheuristics and RL: For high-dimensional action spaces or facilities with complex constraints, simulated annealing, genetic algorithms, or reinforcement learning agents are deployed to optimize workload-to-server placements, frequency/voltage scaling, and multi-variable cooling setpoints (beyond what is feasible with direct analytic optimization) (Naug et al., 31 Oct 2025, Arroba et al., 2023, Sarkar et al., 13 Nov 2025).

3. Implementation Strategies and Surrogate Modeling

Practical implementation of joint models requires careful system abstraction, high-throughput simulation, and surrogacy for intractable physical solvers.

  • Surrogate Modeling for Real-Time Use: 3D surrogate models trained on CFD data (e.g., using 3D CNN U-Nets, Fourier Neural Operators, and vision transformer variants) offer up to 20,000×20,000\times speedup over direct CFD, delivering <2C<2 ^\circ C MAE on held-out data, supporting real-time, closed-loop thermal optimization (Sarkar et al., 13 Nov 2025).
  • Digital Twin and Gymnasium Integration: Modelica-based digital twins are compiled into FMUs and wrapped with RL-compatible Gymnasium interfaces, exposing observable state vectors (blade temps, rack flows) and multi-modal action spaces (continuous/discrete) for algorithmic exploration (Naug et al., 31 Oct 2025).
  • Multi-Tool Coupling: Early-phase 3DIC co-design requires orchestrating gem5 (performance), McPAT (power), and HotSpot (temperature), with automatic I/O matching, time-trace synchronization, and bidirectional power–temperature feedback to account for temperature-dependent leakage (Wang et al., 10 Mar 2025).

4. Optimization Algorithms and Control Policies

Diverse algorithmic strategies are adopted across scales, reflecting the complexity of joint optimization landscapes.

  • MILP/ILP and Heuristics: Integer programming governs capacity- and temperature-constrained workload distribution; tractability for large N is achieved by relaxation and specially engineered rounding heuristics (three-phase rounding, greedy extraction, and local-improvement swaps) (Rostami et al., 2023).
  • Multi-Objective Metaheuristics: Power and temperature-aware VM allocation employs best-fit decreasing combined with Pareto-dominance filtering or SA-inspired regression-driven predictors for near-optimal runtime placement (Arroba et al., 2023).
  • MPC and Cascade Control: Cooling-side control is often formulated as receding-horizon model-predictive control, using workload forecasts to preemptively allocate setpoints for supply temperatures and flows, integrating constraints on rack inlet, GPU, and exhaust temperatures (Abera et al., 13 Jan 2026).
  • Reinforcement Learning: Multi-agent deep RL learns policies for liquid cooling towers, CDU loops, and IT cabinet setpoints, balancing multiple objectives (energy, thermal compliance) under stochastically varying workloads and weather patterns (Naug et al., 31 Oct 2025).

5. Empirical Validation, Case Studies, and Performance Metrics

Comprehensive empirical validation quantifies joint cooling–computing frameworks' fidelity and operational gains.

Metric Surrogate Models (Sarkar et al., 13 Nov 2025) RL-Controlled Liquid Cooling (Naug et al., 31 Oct 2025) Metaheuristics (Arroba et al., 2023)
Inference latency 0.07–0.17 s (surrogate) ~real-time (FMU simulation speed) 1.7–3.1 s per 300 s slot (BFD)
MSE / MAE (°C) MSE: ~0.0036–0.0048, MAE: 1.89–2.60 Thermo-fluidic state error not reported CPU temp error: \approx 0.8%
Energy/carbon savings 7% (cooling/carbon) Up to 31.2% energy, 17% mean GPU T drop 21.74% (total energy)
SLA / latency SLO maintained SLO met, no thermal violations SLA violation <0.014% vs baseline

Field validation (Lei, 2020) demonstrates ≤4% PUE mean error vs. reported hyperscale data centers and models global energy footprints, while case studies in 3DIC co-design (Cool-3D) show thermal/architectural co-optimization reducing local peak temperature by up to 30 K via stack reordering and microfluidic pattern selection (Wang et al., 10 Mar 2025).

6. Application Domains and Representative Systems

Joint cooling–computing models span the data center/edge/cloud spectrum:

  • Hyperscale Cloud and AI Datacenters: Hierarchical controllers integrate GPU pool scheduling, dynamic frequency/voltage scaling, and predictive cooling MPC, empirically reducing compute-side and cooling-side energy by 24% and 31%, respectively, on real AI inference traces (Abera et al., 13 Jan 2026).
  • HPC and Liquid Cooling: Modelica-based digital twins with multi-agent RL and centralized/decentralized action spaces address ultra-high-density, fine-grained valve and setpoint management in leading supercomputers (Naug et al., 31 Oct 2025).
  • Cloud Resource Management: Metaheuristic-based workload placement algorithms co-optimize power, cooling, and migration energy, scaling to thousands of servers and VMs, achieving quantifiable PUE improvements (Arroba et al., 2023).
  • Edge Computing with Cooling-Aware Offloading: Mobile edge systems minimize coupled cooling and compute energy through latency/demand-aware dynamic partitioning of local and edge execution (Chen et al., 2020).
  • Waste-to-Energy (WtE) Coupling: Urban data corridors integrate WtE-driven absorption cooling, yielding siting and corridor-length optimization via energy-grade matching, grid relief, and ESG valuation (He et al., 31 Dec 2025).

7. Outlook: Efficiency Levers and Future Directions

The synthesized literature highlights key efficiency drivers and methodological directions:

  • Surrogatization of Expensive Physics: Deep surrogates enable deployment of joint physical–computational models in operational, closed-loop data center control—supporting real-time hot-spot prediction and dynamic workload redistribution.
  • Hierarchical Decomposition: Partitioning the joint optimization over workload, computing, and cooling variables, aligned to practical timescales and computational feasibility, improves tractability and solution quality.
  • Utilization of Physical Co-Design: Stack ordering, microfluidic layout, and cache sizing can be jointly manipulated at design time using co-simulation frameworks to achieve holistically optimized thermal-power profiles.
  • Integration of Thermoeconomics and ESG Accounting: High-level input–output frameworks supply quantifiable metrics (electric/system PUE, exergy, net benefit, LCOC, methane-avoidance credits) that align physical co-optimization with economic and sustainability goals.

Overall, joint cooling and computing modeling approaches provide a rigorous, cross-layer methodology for understanding, optimizing, and controlling the coupled energy-performance-thermal landscape of modern compute infrastructures (Sarkar et al., 13 Nov 2025, Naug et al., 31 Oct 2025, Abera et al., 13 Jan 2026, Lei, 2020, Wang et al., 10 Mar 2025, He et al., 31 Dec 2025, Arroba et al., 2023, Chen et al., 2020, Rostami et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Joint Cooling and Computing Modeling Approach.