Notes on Distributed Training for LLMs

Jun 2 2026 ~

Introductory Background

https://www.youtube.com/watch?v=xkH8shGffRU

https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-intro.html

https://huggingface.co/docs/transformers/v4.15.0/en/parallelism

Papers

Notes on “Efficient Training of Large Language Models on Distributed Infrastructures: A Survey”

Notes on “Vidur: A Large-Scale Simulation Framework for LLM Inference”

LLM Training Steps

Forward Pass
Backward Pass (Gradient Calculation)
Optimizer Step - update the parameters using the calculated gradients

Distributed Training Parallelism:

Data Parallelism (DP): Split the data into batches and copy the model into the participating GPUs
- Synchronization: required after backward pass and before optimizer step (gradient aggregation using AllReduce)
- Limitation: Model size is limited by the VRAM of GPU
- Others:
  - Sharded DDP (= Zero DP): Goal — minimize local memory usage
    - Each GPU gets a horizontal slice of the weight parameters
    - Each GPU receives different input data for training
    - Each layer needs the other horizontal slices of weight parameters to process the input tensor fully —> AllGather operation for each layer
    - The gathered parameters are freed from local memory and the output activation is continuously processed by subsequent layers
Model Parallelism (MP): Split the model horizontally or vertically
1. Tensor Parallelism (TP): Split the model horizontally (mainly to calculate matrix multiplication faster) — each GPU has a portion of t¬he split parameter matrix
  - Synchronization: required after each consecutive matrix multiplications
  - Limitation: ..
2. Pipeline Parallelism (PP): Split the model vertically (a group of layers is distributed across GPUs)
  - Synchronization: required after each group of layers
  - Limitation: purely relying on Pipeline Parallelism can lead to idle resources
  - Others:
    - https://research.google/blog/introducing-gpipe-an-open-source-library-for-efficiently-training-large-scale-neural-network-models/
      - Split the data into smaller chunks and then into micro-batches, and feed them to the first layer of the model (first GPU) to prevent idle resources
    - Interleaved Pipeline