The cloned GitHub directory does not contain example Chakra ETs for ASTRA-sim + NS-3, as shown in the error below:

image-20240920-165525.png

We must generate some using the Chakra ET generator provided.


  1. Go to the directory that holds the chakra ET generator
cd ~/astra-sim/extern/graph_frontend/chakra
  1. Run the following script to generate example scripts. More information is found on the following link: chakra/src/generator/generator.py at main · mlcommons/chakra

python3 -m chakra.et_generator.et_generator --num_npus 8 --num_dims 1
et_generator.py [-h] [--num_npus NUM_NPUS] [--num_dims NUM_DIMS]
                      [--default_runtime DEFAULT_RUNTIME]
                      [--default_tensor_size DEFAULT_TENSOR_SIZE]
                      [--default_comm_size DEFAULT_COMM_SIZE]
  --num_npus NUM_NPUS   Number of NPUs
  --num_dims NUM_DIMS   Number of dimensions in the network topology
  --default_runtime DEFAULT_RUNTIME
                        Default runtime of compute nodes
  --default_tensor_size DEFAULT_TENSOR_SIZE
                        Default tensor size of memory nodes
  --default_comm_size DEFAULT_COMM_SIZE
                        Default communication size of communication nodes
  1. This will generate simple traces for the following examples:
one_metadata_node_all_types(args.num_npus)
one_remote_mem_load_node(args.num_npus, args.default_tensor_size)
one_remote_mem_store_node(args.num_npus, args.default_tensor_size)
one_comp_node(args.num_npus, args.default_runtime)
two_comp_nodes_independent(args.num_npus, args.default_runtime)
two_comp_nodes_dependent(args.num_npus, args.default_runtime)
one_comm_coll_node_allreduce(args.num_npus, args.default_comm_size)
one_comm_coll_node_alltoall(args.num_npus, args.default_comm_size)
one_comm_coll_node_allgather(args.num_npus, args.default_comm_size)
one_comm_coll_node_reducescatter(args.num_npus, args.default_comm_size)
one_comm_coll_node_broadcast(args.num_npus, args.default_comm_size)
one_comm_coll_node_barrier(args.num_npus)
one_comm_send_node(args.num_npus, args.default_tensor_size)
one_comm_recv_node(args.num_npus, args.default_tensor_size)
  1. Let’s take a look at an example – one_comm_coll_node_allgather
    1. This is a very simple ET that has no dependencies. It has few Chakra attributes (communication type (ALL_GATHER), is_cpu_op, communication size, and involved dimensions).

      image-20240918-035243.png

      image-20240918-035339.png

This is an all_gather node extracted from above converted into a json file:

{
  "id": "80",
  "name": "ALL_GATHER",
  "type": "COMM_COLL_NODE",
  "attr": [
    {
      "name": "is_cpu_op",
      "boolVal": false
    },
    {
      "name": "comm_type",
      "int64Val": "2"
    },
    {
      "name": "comm_size",
      "uint64Val": "65536"
    },
    {
      "name": "involved_dim",
      "boolList": {
        "values": [
          true
        ]
      }
    }
  ]
}