References

distributed_basics.pdf

https://www.youtube.com/watch?v=JA1l96tjrs4


Intra-Operator Parallelism (= Partitioning)

image.png

  1. Data Parallelism
    1. Data is split into mini-batches β†’ each device gets a different mini-batch to process
    2. Performs parameter updates on its independent data split
    3. Synchronises the updates with other workers before weight update (all reduce)
    4. Note: Large models not recommended to use this as model is replicated (memory)
  2. Operator Parallelism
    1. Approaches that partition the computation of a specific operator
    2. Compute each part of the operator in parallel across multiple devices
    3. Note: Communication
      1. Communication is necessary because data is split in different axes (input tensors are jointly partitioned)
      2. Operations are done in parallel, which means there can be dependencies in input tensors in some other devices

Inter-Operator Parallelism

image.png

  1. Pipeline Parallelism
    1. Places different groups of operations from model on different workers
    2. Data is split into mini-batches, and pipelines the forward and backward pass across micro-batches on distributed workers
    3. Communication:
      1. devices communicate only btw pipeline stages (point-to-point communication)
      2. Inter-operator parallelism usually results in some devices being idle (due to data dependencies)