Problem: Collective optimizations have been done separately so far, and there is a need for co-designing distributed ML workloads with collective algorithms.

image.png

Solution: Use a common representation (Chakra ET) for both ML Workloads and Collective Algorithms that can be used as ingest to simulators or executions

Benefits of Using Chakra ET:

  1. Co-optimize collective communication with other workload-related operations
  2. Interoperability across different tools (ASTRA-sim for simulations and MSCCL-Runtime for execution/validation of algorithms)
  3. No more need to know per-simulator knowledge to implement collective algorithms

Background

Upstream Collective Algorithm Producers

Downstream Distributed Machine Learning Tools

Takeaways