Jun 2 2026 ~

Introductory Background

https://www.youtube.com/watch?v=xkH8shGffRU

https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-intro.html

https://huggingface.co/docs/transformers/v4.15.0/en/parallelism

Papers

Notes on “Efficient Training of Large Language Models on Distributed Infrastructures: A Survey”


LLM Training Steps

  1. Forward Pass
  2. Backward Pass (Gradient Calculation)
  3. Optimizer Step - update the parameters using the calculated gradients

Distributed Training Parallelism:

  1. Data Parallelism (DP): Split the data into batches and copy the model into the participating GPUs

  2. Model Parallelism (MP): Split the model horizontally or vertically

    1. Tensor Parallelism (TP): Split the model horizontally (mainly to calculate matrix multiplication faster) — each GPU has a portion of t¬he split parameter matrix

      • Synchronization: required after each consecutive matrix multiplications
      • Limitation: ..
    2. Pipeline Parallelism (PP): Split the model vertically (a group of layers is distributed across GPUs)