astrasim_2.0.pdf
ASTRA-sim2.0:
Goal: Aim to model complete SW/HW co-design stack of distributed systems
- Workload layer implements training loop (DNN model; parallelization strategy)
- System layer provides various collective communication algorithm implementations
- Networking layer models the HW/SW components of the network and simulates traffic issued by system layer
SW/HW co-design stack of Distributed Training

Current Limitations of ASTRA-sim 2.0:
- Software technology is growing
- ASTRA-sim cannot support new parallelism strategies
- There are many new parallelism strategies (3D-parallelism, FSDP, ZeRO …)
- Hardware technology is growing
- ASTRA-sim’s gem5 network layer has limitations in modeling certain platforms
- multi-dimensional network topologies with hierarchical bandwidths to interconnect NPUs
- naive method:
- increase aggregated bandwidth → fundamentally limited by technology
- using NIC to scale out due to dollar-cost, power, thermal problems
- Memory disaggregation is becoming important
- ASTRA-sim uses simple BW # to model memory (can’t capture new arch.)
- CXL (Compute Express Link) to allow GPUs to access a larger remote memory pool
Background (Distributed Training)
Training Types:
- Asynchronous Training: communicates with NPUs asynchronously
- Suffers from convergence problem
- Synchronous Training: each NPU works independently, and synchronizes before proceeding to next iteration
- Done with collective communications