Mesh-TensorFlow: Deep Learning for Supercomputers

Published 5 Nov 2018 in cs.LG, cs.DC, and stat.ML | (1811.02084v1)

Abstract: Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the "batch" dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT'14 English-to-French translation task and the one-billion-word language modeling benchmark. Mesh-Tensorflow is available at https://github.com/tensorflow/mesh .

Abstract PDF Upgrade to Chat

Citations (364)

View on Semantic Scholar

Summary

The paper introduces Mesh-TensorFlow that overcomes data-parallel limits by enabling flexible model-parallelism across processor meshes.
It demonstrates efficient scaling of models up to 5 billion parameters on TPU clusters, achieving state-of-the-art results on translation and language tasks.
The framework enables precise tensor splitting, reducing memory bottlenecks and latency while optimizing computation distribution.

Mesh-TensorFlow: Deep Learning for Supercomputers

The paper "Mesh-TensorFlow: Deep Learning for Supercomputers" introduces a novel framework aimed at overcoming limitations encountered with data-parallel training systems in the domain of deep learning. The authors, affiliated with Google Brain, propose Mesh-TensorFlow as a system that facilitates efficient distributed tensor computations, leveraging a more generalized approach, strategically termed model-parallelism.

Core Contributions

Mesh-TensorFlow extends beyond the limitations of traditional data-parallelism by enabling arbitrary tensor dimension splitting across a multi-dimensional processor mesh. This capability allows practitioners to optimize model training on large-scale clusters, such as TPU meshes with extensive cores. By specifying the split dimensions, users gain fine-grained control over computation distribution, thereby facilitating both data-parallel and model-parallel paradigms. This flexibility can resolve common issues such as memory bottlenecks, high latency, and inefficiencies at small batch sizes.

Implementation and Results

The paper details how Mesh-TensorFlow was utilized to deploy an efficient distributed version of the Transformer model. The implementation on TPU clusters with up to 512 cores achieved superior performance with models scaling up to 5 billion parameters. Notably, this led to state-of-the-art results on tasks including WMT'14 English-to-French translation and language modeling benchmarks. The use of Mesh-Tensorflow allows for a combinatorial expansion of both batch and model dimensions, maintaining efficient utilization of available compute resources without a proportional increase in computational overhead or memory usage per processor.

Implications and Future Directions

The implications for deploying large neural network models on supercomputers are significant, enabling more complex and capable models that were previously constrained by hardware limitations. Practically, this approach offers a path forward for training expansive models across distributed systems, breaking away from the strict confines of batch-splitting.

Theoretically, this suggests that further exploration into flexible tensor dimension manipulation could yield additional optimizations in distributed computing environments. As neural networks become increasingly complex, tools like Mesh-TensorFlow will likely become indispensable, especially for tasks demanding extensive computational resources.

Future research could focus on automating layout optimizations within the framework, ensuring optimal deployment strategies for various models automatically. Expanding Mesh-TensorFlow's applicability to CPU/GPU clusters might also democratize its use beyond specialized hardware like TPUs, enhancing versatility across different computational infrastructures.

Conclusion

Mesh-TensorFlow addresses critical barriers in distributed model training, providing a robust architecture conducive for supercomputing environments. By enabling precise control over tensor distributions, it represents a significant step forward in the efficient training of large-scale neural networks, potentially paving the way for advancements in artificial intelligence and machine learning capabilities. Its open-source availability suggests broad availability for future research and application development in scalable deep learning solutions.