Papers
Topics
Authors
Recent
Search
2000 character limit reached

DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems

Published 9 Oct 2025 in cs.LG and cs.DC | (2510.08522v1)

Abstract: Existing batch size selection approaches in dis- tributed machine learning rely on static allocation or simplistic heuristics that fail to adapt to heterogeneous, dynamic computing environments. We present DYNAMIX, a reinforcement learning framework that formulates batch size optimization as a sequen- tial decision-making problem using Proximal Policy Optimiza- tion (PPO). Our approach employs a multi-dimensional state representation encompassing network-level metrics, system-level resource utilization, and training statistical efficiency indicators to enable informed decision-making across diverse computational resources. Our approach eliminates the need for explicit system modeling while integrating seamlessly with existing distributed training frameworks. Through evaluations across diverse work- loads, hardware configurations, and network conditions, DY- NAMIX achieves up to 6.3% improvement in the final model accuracy and 46% reduction in the total training time. Our scalability experiments demonstrate that DYNAMIX maintains the best performance as cluster size increases to 32 nodes, while policy transfer experiments show that learned policies generalize effectively across related model architectures.

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.