- Traditionally: PCI Express (PCIe) ⇒ tree topology with ~ 0 GB/s per lane
- limited to bandwidth, latency, and scalability
- Trend: NVLink → higher bandwidth and lower latency
- uses various topologies (cube-mesh, fully-connected, 3D-torus)
- employs shared memory models, specialized communication protocols, and synchronization mechanisms


- Cube-Mesh Topology: NVLink-1.0 offers each link with BW of 160 GB/s
- planar-mesh for 4 GPUs
- cube-mesh topology for 8 GPUs
- Fully-Connected Topology: Switch-based / P2P-based
- Switch-based: NVSwitch1.0~3.0 (provides 300, 600, 900 GB/s respectively)
- P2P-based: bottleneck is the bandwidth of the directly connected link
- 2D/3D Torus: Used by Google’s TPU system
- 2D Torus: 1 chip connected to 4 of its neighbors
- 3D Torus: 4 connections per chip, but in a 3D dimensional cube