astrasim_2.0.pdf

ASTRA-sim2.0:

Goal: Aim to model complete SW/HW co-design stack of distributed systems

  1. Workload layer implements training loop (DNN model; parallelization strategy)
  2. System layer provides various collective communication algorithm implementations
  3. Networking layer models the HW/SW components of the network and simulates traffic issued by system layer

SW/HW co-design stack of Distributed Training

image.png

Current Limitations of ASTRA-sim 2.0:

  1. Software technology is growing
    1. ASTRA-sim cannot support new parallelism strategies
    2. There are many new parallelism strategies (3D-parallelism, FSDP, ZeRO …)
  2. Hardware technology is growing
    1. ASTRA-sim’s gem5 network layer has limitations in modeling certain platforms
    2. multi-dimensional network topologies with hierarchical bandwidths to interconnect NPUs
    3. naive method:
      1. increase aggregated bandwidth → fundamentally limited by technology
      2. using NIC to scale out due to dollar-cost, power, thermal problems
  3. Memory disaggregation is becoming important
    1. ASTRA-sim uses simple BW # to model memory (can’t capture new arch.)
    2. CXL (Compute Express Link) to allow GPUs to access a larger remote memory pool

Background (Distributed Training)

Training Types:

  1. Asynchronous Training: communicates with NPUs asynchronously
    1. Suffers from convergence problem
  2. Synchronous Training: each NPU works independently, and synchronizes before proceeding to next iteration
    1. Done with collective communications