Purpose: Data and ML Model sizes have grown, and these are more distributed across more ranks. As a result, collective communication operations incur significant latency, which kindles the need for testing different algorithm/topologies on such clusters for optimization.
Problem Statement:
Goal: The framework will input an unmodified ML workload written in a library such as PyTorch or TensorFlow, and produce traffic patterns compatible with various network simulators. This capability allows for generating highly accurate traffic patterns based on the input workload, facilitating agile iterations in network stack design without the need for actual GPU clusters.
Wiki Notes
Part 1 - Learning Recent Trends
Part 2 - Finding the Problem of Current Frameworks
Part 3 - ns3 Implementation of Collective Operations
Noteworthy Pages:
3 - Interface of ASTRA-sim+ns-3
4 - Scheduling in ASTRA-sim + NS-3
Responsibilities: