Questions:

  1. In the design proposal, we use small-scale clusters to define the computations and operations before scaling them up for large-scale clusters. I understand that this ‘scale-up’ is a way to avoid the use of large-clusters for simulations, but would this solution still have the ‘agility’ that the paper claims to provide? If we run large models that are intended to be run on large-scale clusters on small-scale clusters, wouldn’t running this on small-scale clusters be very time-consuming, require huge memory space, and be power-hungry?
  2. I think I need an explanation of where the variables of networks (e.g.,, network topology, parallelism strategies, collective algorithms) come into play in the diagram. I am not sure how the profiling happens for the pattern recorder process and how it relates to the aforementioned network variables.

Problem with Existing Solutions: There are new network stacks tailored for distributed ML training, including advancements in congestion control algorithms, load balancing, and network topologies. Modern frameworks that aim to generate traces are limited in the following:

  1. Benchmark suites that focus on collective operations only

    → does not accurately model ML workloads that also performs computation

    → requires a physical cluster

  2. Simulators require manual modeling of distributed ML workloads to generate realistic traffic patterns

    → replaying ETs is a non-manual method but it requires physical cluster

Goal: Network trace generator for distributed ML training that automatically models traffic patterns based on input workload.

Specific Target Problems:

  1. Limited Access to Clusters - not everyone has access to clusters for network stack proposals
    1. Goal 1: Make network simulations independent to cluster hardware
  2. Plethora of Variables in the Network Space
    1. Goal 2: Generalize the trace generation to accommodate various topologies / parallelism techniques / congestion control techniques
  3. Need for Accuracy
    1. Goal 3: Must collect as much information from ML workload to provide accurate description of operators for a reliable and high fidelity network modeling system

Methodology:

image.png

  1. Workload Profiler: extract and scale distributed ML workload specification from a given distributed ML job → remove need for physical clusters
  2. Pattern Recorder: convert workload specifications to network simulator compatible traffic patterns → encodes dependencies btw messages to represent an arbitrary collective alg.
  3. Pattern Sim. Translator: bridges the patterns to various network simulators