SpecInF: Exploiting Idle GPU Resources in Distributed DL Training via Speculative Inference Filling

Published 4 Mar 2025 in cs.DC | (2503.02550v3)

Abstract: Deep Learning (DL), especially with LLMs, brings benefits to various areas. However, DL training systems usually yield prominent idling GPU resources due to many factors, such as resource allocation and collective communication. To improve GPU utilization, we present SpecInF, which adopts a Speculative Inference Filling method to exploit idle GPU resources. It collocates each primary training instance with additional inference instances on the same GPU, detects the training bubbles and adaptively fills with online or offline inference workloads. Our results show that SpecInF can effectively enhance GPU utilization under mainstream parallel training modes, delivering additional up to 14$\times$ offline inference throughputs than TGS and 67\% reduction in online inference p95 latency than MPS, while guaranteeing collocated training throughput.