https://pytorch.org/tutorials/intermediate/dist_tuto.html#collective-communication
https://pdc-support.github.io/introduction-to-mpi/07-collective/index.html
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#reducescatter
- Broadcast: same data is sent from root rank to all ranks

- Scatter: data in sending buffer on root rank is split into chunks and each chunk is sent to different ranks

- Gather: each rank sends data in their sending buffer to the root rank

- AllGather: each rank sends data in their sending buffer to all other ranks

- Reduce: Each rank sends a piece of data, which are combined on their way to root rank into a single piece of data

- AllReduce: Same as Reduce but the result is sent to all ranks

- ReduceScatter: Same as Reduce but the results are made across rows of ranks, and sent to other ranks
