Chip-to-Chip Communication: data transfer between AI accelerators within a node

Traditionally: PCI Express (PCIe) ⇒ tree topology with ~ 0 GB/s per lane
1. limited to bandwidth, latency, and scalability
Trend: NVLink → higher bandwidth and lower latency
1. uses various topologies (cube-mesh, fully-connected, 3D-torus)
2. employs shared memory models, specialized communication protocols, and synchronization mechanisms

Cube-Mesh Topology: NVLink-1.0 offers each link with BW of 160 GB/s
- planar-mesh for 4 GPUs
- cube-mesh topology for 8 GPUs
Fully-Connected Topology: Switch-based / P2P-based
- Switch-based: NVSwitch1.0~3.0 (provides 300, 600, 900 GB/s respectively)
- P2P-based: bottleneck is the bandwidth of the directly connected link
2D/3D Torus: Used by Google’s TPU system
- 2D Torus: 1 chip connected to 4 of its neighbors
- 3D Torus: 4 connections per chip, but in a 3D dimensional cube