Background

Aug 25 - Collective Operations

Aug 27, 28 - Distributed Training Basics

Related Works

Aug 22 - Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Aug 23 - ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

Aug 24 - Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

Aug 27, 28 - Our Short Paper: FENNEC

Aug 29-30: Impact of RoCE Congestion Control Policies on Distributed Training of DNNs

Sept 2: Towards a Standardized Representation for Deep Learning Collective Algorithms

Important References

https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/ (Aug 23)

https://www.youtube.com/watch?v=1lhrGRqqPWU (Aug 24)

RDMA over Ethernet for Distributed Training at Meta Scale | Proceedings of the ACM SIGCOMM 2024 Conference

Talks about the SOTA trend of cluster architecture, highlighting the importance of hardware-software co-design for distributed systems. For an efficient training system (one that minimizes power usage, maximizes performance, and minimizes latency), we must understand the network topology, hardware specifications, and model parameters to make accurate decisions in the training design.

Part 1 - Learning Recent Trends