Jun 2 2026 ~
https://www.youtube.com/watch?v=xkH8shGffRU
https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-intro.html
https://huggingface.co/docs/transformers/v4.15.0/en/parallelism
Notes on “Efficient Training of Large Language Models on Distributed Infrastructures: A Survey”
LLM Training Steps
Distributed Training Parallelism:
Data Parallelism (DP): Split the data into batches and copy the model into the participating GPUs
AllReduce
)AllGather
operation for each layerModel Parallelism (MP): Split the model horizontally or vertically
Tensor Parallelism (TP): Split the model horizontally (mainly to calculate matrix multiplication faster) — each GPU has a portion of t¬he split parameter matrix
Pipeline Parallelism (PP): Split the model vertically (a group of layers is distributed across GPUs)
Split the data into smaller chunks and then into micro-batches, and feed them to the first layer of the model (first GPU) to prevent idle resources
Interleaved Pipeline