Background

Problems of Modern FPGA as SmartNIC:

Portability Challenges for Implementing High-Performance / Collective Abstractions:

Untitled

FPGA application relies on CPU network stack for inter-FPGA data movement

Untitled

FPGA CCL enables direct networking between FPGAs

Untitled

CCL implementation for FGPA application with shared virtual memory

Untitled

CCL’s role as an offload engine for a CPU application with shared virtual memory

Related Work

  1. Direct memory transfer between GPU memories through RDMA NIC is present
    1. not applicable to FPGA based collective different
      1. can connect directly to network
      2. do not need external NIC on FPGA
  2. Galapagos & EasyNet: provide in-FPGA communication stacks for data exchange within cluster
  3. TMD-MPI: orchestrates in-FPGA collectives using embedded processors
    1. bottleneck: control due to sequential execution in low-frequency FPGA microprocessors
  4. Collective offload with NetFPGA
    1. problem: static collective offload engines limit flexibility
    2. rely on software-defined network switches for orchestration
  5. SMI: streaming message-passing model
    1. expose streaming collective interface to FPGA kernels
    2. SMI enables kernels to initiate collectives directly
    3. Problem: employs dedicated FPGA logic for collective control
      1. limit flexibility for post-synthesis reconfiguration
  6. ACCL: focuses on message-passing collectives for FPGA applications
    1. coordination of collectives requires CPU involvement
    2. lacks streaming support
  7. BluesMPI: offload collective operations to Smart-NIC, demonstrating comparable communication latency to host-based collectives
    1. Problem: does not target accelerator applications