References
distributed_basics.pdf
https://www.youtube.com/watch?v=JA1l96tjrs4
Intra-Operator Parallelism (= Partitioning)

- Data Parallelism
- Data is split into mini-batches β each device gets a different mini-batch to process
- Performs parameter updates on its independent data split
- Synchronises the updates with other workers before weight update (all reduce)
- Note: Large models not recommended to use this as model is replicated (memory)
- Operator Parallelism
- Approaches that partition the computation of a specific operator
- Compute each part of the operator in parallel across multiple devices
- Note: Communication
- Communication is necessary because data is split in different axes (input tensors are jointly partitioned)
- Operations are done in parallel, which means there can be dependencies in input tensors in some other devices
Inter-Operator Parallelism

- Pipeline Parallelism
- Places different groups of operations from model on different workers
- Data is split into mini-batches, and pipelines the forward and backward pass across micro-batches on distributed workers
- Communication:
- devices communicate only btw pipeline stages (point-to-point communication)
- Inter-operator parallelism usually results in some devices being idle (due to data dependencies)