Aug 25 - Collective Operations
Aug 27, 28 - Distributed Training Basics
Aug 24 - Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks
Aug 27, 28 - Our Short Paper: FENNEC
Aug 29-30: Impact of RoCE Congestion Control Policies on Distributed Training of DNNs
Sept 2: Towards a Standardized Representation for Deep Learning Collective Algorithms
https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/ (Aug 23)
https://www.youtube.com/watch?v=1lhrGRqqPWU (Aug 24)
Talks about the SOTA trend of cluster architecture, highlighting the importance of hardware-software co-design for distributed systems. For an efficient training system (one that minimizes power usage, maximizes performance, and minimizes latency), we must understand the network topology, hardware specifications, and model parameters to make accurate decisions in the training design.
Part 1 - Learning Recent Trends