3 - Interface of ASTRA-sim+ns-3

3-1. Basic Structure of ASTRA-sim

Workload Layer: user defines and describes target DNN models, target parallelization strategies, and training loops → essentially where the real work is placed

System Layer: implements collective communication algorithms, schedules compute/communication operations, and manages compute-communication overlaps → essentially where scheduling of operations is done

Network API: communication times are computing using analytical or event-driven simulators (NS-3).

3-2. Inside the Code

3-2-1. Basic Flow of ASTRA-sim + NS-3

Workload Layer:

initialize system and network layers per node
initialize ns3 simulator

Workload to System Layer:

“fire()” the workloads
workload begins parsing the Chakra ET with ET_Feeder() in Chakra module
iterates through the operation nodes in the Chakra ET file
- fire() → issue(node)
- Within the issue call, the node’s type is checked with three branches:
  - MEM_LOAD_NODE or MEM_STORE_NODE
  - CPU Operation or GPU Computation Node (COMP_NODE)
    - if COMP Type: issue_comp() occupies hardware resources
  - GPU Collective Node (COMM_COLL_NODE, COMM_SEND_NODE, COMM_RECV_NODE)

System to Network Layer:

if COMM Type and not done by CPU: issue_comm() calls generate_collective communication type → calls generate_collectives()
generate_collective():
- Collective communication message size is split into chunk size, and these chunks are assigned their own stream
- Sys::SchedulerUnit::notify_stream_added_into_ready_list(): These streams are put into a ready list, which is notified to the system’s scheduler
- Sys::schedule(int num): calls proceed_to_next_vnet_baseline()
- Sys::proceed_to_next_vnet_baseline: inserts stream into active stream list
- Sys::SchedulerUnit::notify_stream_added(int vnet): notifies the scheduler unit that the stream is now active and ready (incrementing the iterator by the number of running streams)
  - Within this code, the following init() is called. EventType::StreamInit is used as input to the respective simulation instance’s collective communication algorithm used (e.g. HalvingDoubling).
```
void StreamBaseline::init() {
  initialized = true;
  last_init = Sys::boostedTick();
  if (!my_current_phase.enabled) {
    return;
  }
  my_current_phase.algorithm->run(EventType::StreamInit, nullptr);
  if (steps_finished == 1) {
    queuing_delay.push_back(last_phase_change - creation_time);
  }
  queuing_delay.push_back(Sys::boostedTick() - last_phase_change);
  total_packets_sent = 1;
}
```
  - After initialization using EventType::StreamInit, the scheduler calls this again with EventType::Generalwhich prepares the stream->owner->front_end_sim_send() and stream->owner->front_end_sim_recv()functions. The following is the code within HalvingDoubling that starts the calls to sim_send or sim_recv:
```
void HalvingDoubling::run(EventType event, CallData* data) {
  if (event == EventType::General) {
    free_packets += 1;
    ready();
    iteratable();
  } else if (event == EventType::PacketReceived) {
    total_packets_received++;
    insert_packet(nullptr);
  } else if (event == EventType::StreamInit) {
    for (int i = 0; i < parallel_reduce; i++) {
      insert_packet(nullptr);
    }
  }
}
```

Some good resources: https://docs.google.com/document/d/14T4fAQe4d9dPq7dZEoEQ_dF6kSq0FaGlFlZLZdSSfx0/