Purpose: Data and ML Model sizes have grown, and these are more distributed across more ranks. As a result, collective communication operations incur significant latency, which kindles the need for testing different algorithm/topologies on such clusters for optimization.

Problem Statement:

  1. SOTA distributed learning simulators require direct access to such large clusters to obtain execution traces
  2. There is significant complexity in implementing algorithms in such simulators
  3. These simulators do not capture the accurate representation of the network events occurring during distributed training (due to extreme simplifications through analytical models or by omitting details present in the network)

Goal: The framework will input an unmodified ML workload written in a library such as PyTorch or TensorFlow, and produce traffic patterns compatible with various network simulators. This capability allows for generating highly accurate traffic patterns based on the input workload, facilitating agile iterations in network stack design without the need for actual GPU clusters.


* August ~ September: Motivational Study of ASTRA-sim

Wiki Notes

Part 1 - Learning Recent Trends

Part 2 - Finding the Problem of Current Frameworks

Part 3 - ns3 Implementation of Collective Operations

Noteworthy Pages:

3 - Interface of ASTRA-sim+ns-3

4 - Scheduling in ASTRA-sim + NS-3

* October~December: Motivational Experiments

Responsibilities:

  1. Implement the fat-tree topology on ns-3 as our test bed to check network accurate packet-level flow patterns