3-1. Basic Structure of ASTRA-sim

image-20240924-185505.png

Workload Layer: user defines and describes target DNN models, target parallelization strategies, and training loops → essentially where the real work is placed

System Layer: implements collective communication algorithms, schedules compute/communication operations, and manages compute-communication overlaps → essentially where scheduling of operations is done

Network API: communication times are computing using analytical or event-driven simulators (NS-3).


3-2. Inside the Code

3-2-1. Basic Flow of ASTRA-sim + NS-3

Workload Layer:

  1. initialize system and network layers per node
  2. initialize ns3 simulator

Workload to System Layer:

  1. “fire()” the workloads
  2. workload begins parsing the Chakra ET with ET_Feeder() in Chakra module
  3. iterates through the operation nodes in the Chakra ET file

System to Network Layer:

  1. if COMM Type and not done by CPU: issue_comm() calls generate_collective communication type → calls generate_collectives()
  2. generate_collective():

Some good resources:  https://docs.google.com/document/d/14T4fAQe4d9dPq7dZEoEQ_dF6kSq0FaGlFlZLZdSSfx0/