Problem: Collective optimizations have been done separately so far, and there is a need for co-designing distributed ML workloads with collective algorithms.
- lack of standardization leads to downstream tools (simulation/executions) to rely on unique internal implementations
- ML workload operations and collective operations are optimized independently

Solution: Use a common representation (Chakra ET) for both ML Workloads and Collective Algorithms that can be used as ingest to simulators or executions
Benefits of Using Chakra ET:
- Co-optimize collective communication with other workload-related operations
- Interoperability across different tools (ASTRA-sim for simulations and MSCCL-Runtime for execution/validation of algorithms)
- No more need to know per-simulator knowledge to implement collective algorithms
Background
Upstream Collective Algorithm Producers
- MSCCLang - a domain specific language that enables users to construct NCCL-based collective algorithms β compiled to XL representation (MSCCL-IR) and run on MSCCL-Runtime
Downstream Distributed Machine Learning Tools
- ASTRA-sim : simulator for distributed learning
- MSCCL-Runtime: way to execute collective algorithms
Takeaways
- With the capability to easily receive any collective algorithm in Chakra ET, downstream tools can expand the collective communication with provided algorithm