congestionCTL_Astrasim.pdf
Main Point: Due to the ever-growing size of DNN models and datasets, distributed learning has become a common method of training DNN models. RDMA over Ethernet (RoCE), which is usually used in datacenters, has therefore been widely used for distributed training scenarios. Because RDMA is done over the Ethernet, which is not a lossless means of communication, congestion control techniques have been suggested to improve RoCE performance in distributed training scenarios. This paper aims to simulate and compare different congestion control techniques.


Methodology: ASTRA-sim to simulate high-level DNN operations ton top of NS3 that captures network characteristics of a two-level Clos topology.
- Assumptions:
-
There is a constant need to train the network with newly generated training data
β Consequence: each node performs a single job to maximize performance
New Concepts:
- PFC (Priority Flow Control): congestion control mechanism where once the receiver-side buffer is full, the receiver sends a PAUSE frame to the sender. The sender stops sending until further notified by the receiver
- DCQCN: Congestion control mechanism that uses ECN marker on the packet header to inform sender to reduce the sending rate
- DCTCP: Congestion control mechanism that also sets ECN flag at switches if the queue occupancy exceeds a threshold
- Note: changes the sending window size depending on the reduction factor determined by the proportion of marked packets
- TIMELY: measures the RTT delay and changes the sending rate depending on T_low and T_high thresholds
- High Precision Congestion Control (HPCC): uses in-network telemetry (INT feature) to change the window size
- e.g. timestamp / queue length, etc. is shared in the packet header
- HPCC-PINT: improves upon HPCC that uses 48 bytes of overhead by using 8 bits only depending on a parameter that determines need for congestion control
Most of the paper discussed results. Check paper for details.