Papers
Topics
Authors
Recent
Search
2000 character limit reached

Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large-Scale Recommendation

Published 1 Mar 2024 in cs.LG, cs.DC, and cs.IR | (2403.00877v3)

Abstract: We study a mismatch between the deep learning recommendation models' flat architecture, common distributed training paradigm and hierarchical data center topology. To address the associated inefficiencies, we propose Disaggregated Multi-Tower (DMT), a modeling technique that consists of (1) Semantic-preserving Tower Transform (SPTT), a novel training paradigm that decomposes the monolithic global embedding lookup process into disjoint towers to exploit data center locality; (2) Tower Module (TM), a synergistic dense component attached to each tower to reduce model complexity and communication volume through hierarchical feature interaction; and (3) Tower Partitioner (TP), a feature partitioner to systematically create towers with meaningful feature interactions and load balanced assignments to preserve model quality and training throughput via learned embeddings. We show that DMT can achieve up to 1.9x speedup compared to the state-of-the-art baselines without losing accuracy across multiple generations of hardware at large data center scales.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Ad-rec: Advanced feature interactions to address covariate-shifts in recommendation networks. arXiv preprint arXiv:2308.14902, 2023.
  2. Constrained k-means clustering. Microsoft Research, Redmond, 20(0):0, 2000.
  3. {{\{{NetHint}}\}}:{{\{{White-Box}}\}} networking for {{\{{Multi-Tenant}}\}} data centers. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pp.  1327–1343, 2022.
  4. Recommender systems leveraging multimedia content. ACM Comput. Surv., 53(5), sep 2020. ISSN 0360-0300. doi: 10.1145/3407190. URL https://doi.org/10.1145/3407190.
  5. Edwards, C. Meta’s grand teton brings nvidia hopper to its data centers — nvidia blog. https://blogs.nvidia.com/blog/2022/10/18/meta-grand-teton/, 10 2022.
  6. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022.
  7. Scalable hierarchical aggregation protocol (sharp): A hardware architecture for efficient data reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC), pp.  1–10, 2016. doi: 10.1109/COMHPC.2016.006.
  8. Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp.  620–629, 2018. doi: 10.1109/HPCA.2018.00059.
  9. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ADKDD’14, pp.  1–9, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450329996. doi: 10.1145/2648584.2648589. URL https://doi.org/10.1145/2648584.2648589.
  10. Torchrec: a pytorch domain library for recommendation systems. In Proceedings of the 16th ACM Conference on Recommender Systems, pp.  482–483, 2022.
  11. Xdl: an industrial deep learning framework for high-dimensional sparse data. In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data, pp.  1–9, 2019.
  12. Learning to embed categorical features without embedding tables for recommendation. arXiv preprint arXiv:2010.10784, 2020.
  13. Fbgemm: Enabling high-performance low-precision deep learning inference. arXiv preprint arXiv:2101.05615, 2021.
  14. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  15. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp.  1754–1763, 2018.
  16. Persia: A hybrid system scaling deep learning based recommenders up to 100 trillion parameters. arXiv preprint arXiv:2111.05897, 2021.
  17. Incbricks: Toward in-network computation with an in-network cache. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pp.  795–809, 2017.
  18. Parameter hub: a rack-scale parameter server for distributed deep neural network training. In Proceedings of the ACM Symposium on Cloud Computing, pp.  41–54, 2018.
  19. Srifty: Swift and thrifty distributed training on the cloud, 2020a. URL https://arxiv.org/abs/2011.14243.
  20. Plink: Discovering and exploiting locality for accelerated distributed training on the public cloud. Proceedings of Machine Learning and Systems, 2:82–97, 2020b.
  21. Doubling all2all performance with nvidia collective communication library 2.12 — nvidia technical blog. https://developer.nvidia.com/blog/doubling-all2all-performance-with-nvidia-collective-communication-library-2-12/, 2022.
  22. Morgan, T. P. The iron that will drive ai at meta platforms - the next platform. https://www.nextplatform.com/2022/10/20/the-iron-that-will-drive-ai-at-meta-platforms/, 2023.
  23. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pp.  993–1011, 2022.
  24. Do transformer modifications transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021.
  25. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15, 2021.
  26. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
  27. Deep learning training in facebook data centers: Design of scale-up and scale-out systems, 2020.
  28. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International Conference on Machine Learning, pp.  18332–18346. PMLR, 2022.
  29. Scalable mpi collectives using sharp: Large scale performance evaluation on the tacc frontera system. In 2020 Workshop on Exascale MPI (ExaMPI), pp.  11–20, 2020. doi: 10.1109/ExaMPI52011.2020.00007.
  30. Enabling compute-communication overlap in distributed deep learning training platforms. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp.  540–553, 2021. doi: 10.1109/ISCA52012.2021.00049.
  31. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  3505–3506, 2020.
  32. Scaling distributed machine learning with in-network aggregation. arXiv preprint arXiv:1903.06701, 2019.
  33. Smelyanskiy, M. Zion: Facebook next- generation large memory training platform. In 2019 IEEE Hot Chips 31 Symposium (HCS), pp.  1–22, 2019. doi: 10.1109/HOTCHIPS.2019.8875650.
  34. Piper: Multidimensional planner for dnn parallelization. Advances in Neural Information Processing Systems, 34, 2021.
  35. Herring: Rethinking the parameter server at scale for the cloud. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press, 2020. ISBN 9781728199986.
  36. TorchRec, F. R. dlrm/torchrec_dlrm at main · facebookresearch/dlrm. https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm/, 2023.
  37. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the web conference 2021, pp.  1785–1797, 2021.
  38. Wikipedia. Minimum k-cut - wikipedia. https://en.wikipedia.org/wiki/Minimum_k-cut, 2023.
  39. Training deep learning recommendation model with quantized collective communications. In Proceedings of the 3rd international workshop on deep learning practice for high-dimensional sparse data, 2021.
  40. Pre-train and search: Efficient embedding table sharding with pre-trained neural cost models. Proceedings of Machine Learning and Systems, 5, 2023.
  41. Dhen: A deep and hierarchical ensemble network for large-scale click-through rate prediction, 2022a. URL https://arxiv.org/abs/2203.11014.
  42. Mics: Near-linear scaling for training gigantic model on public cloud. arXiv preprint arXiv:2205.00119, 2022b.
  43. Aibox: Ctr prediction model training on a single node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp.  319–328, 2019.
  44. Communication-efficient terabyte-scale model training framework for online advertising. arXiv preprint arXiv:2201.05500, 2022.
  45. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
  46. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. arXiv preprint arXiv:2201.12023, 2022.
  47. A comprehensive survey on multimodal recommender systems: Taxonomy, evaluation, and future directions. arXiv preprint arXiv:2302.04473, 2023.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 0 likes about this paper.