Notes on “Efficient Training of Large Language Models on Distributed Infrastructures: A Survey” | Notion

Paper:

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey (Duan et al.)

Quick Links:

Chip-to-Chip Communication: data transfer between AI accelerators within a node

Node-to-Node Communication: data transfer between nodes

Transformer-Based LLMs

Background:

Token vector is embedded with positional information to preserve sequential nature of text
input token vector is transformed into query (Q), key (K), and value (V) tensors via linear transformation and these are used to compute a weighted representation of the V tensor:

Similarity between the queries and keys are used to calculate the weight
allows for the LLM to focus on relevant parts of the input sequence
Cost: Multi-Head Attention has high memory consumption (key-value cache)
- Addressed by Multi-Query Attention (MQA), Group-Query Attention (GQA), Multi-Latent Attention (MLA)

LLM Training Workloads

GPT, LLaMa, MOSS → Transformer architecture (uniform architecture allows for optimization)
Megatron, Alpa → accelerated training through hybrid parallelism
DeepSpeed → reduces memory consumption through state-sharding optimizers
Generally adopt self-supervised training on extensive datasets to create foundational models which are adapted for various tasks