Paper:

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey (Duan et al.)

Quick Links:

Chip-to-Chip Communication: data transfer between AI accelerators within a node

Node-to-Node Communication: data transfer between nodes

Transformer-Based LLMs

Background:

image.png

LLM Training Workloads