Paper:
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey (Duan et al.)
Quick Links:
Chip-to-Chip Communication: data transfer between AI accelerators within a node
Node-to-Node Communication: data transfer between nodes
Transformer-Based LLMs
Background:
- Token vector is embedded with positional information to preserve sequential nature of text
- input token vector is transformed into query (Q), key (K), and value (V) tensors via linear transformation and these are used to compute a weighted representation of the V tensor:

- Similarity between the queries and keys are used to calculate the weight
- allows for the LLM to focus on relevant parts of the input sequence
- Cost: Multi-Head Attention has high memory consumption (key-value cache)
- Addressed by Multi-Query Attention (MQA), Group-Query Attention (GQA), Multi-Latent Attention (MLA)
LLM Training Workloads
- GPT, LLaMa, MOSS → Transformer architecture (uniform architecture allows for optimization)
- Megatron, Alpa → accelerated training through hybrid parallelism
- DeepSpeed → reduces memory consumption through state-sharding optimizers
- Generally adopt self-supervised training on extensive datasets to create foundational models which are adapted for various tasks