ACCL+: FPGA-Based Collective Engine for Distributed Applications

Background

Problems of Modern FPGA as SmartNIC:

Difficult to use FPGA in distributed applications:
- Network stack on FPGA is not compatible with data center infra.
- Data center infra requires complex network transfers
  - FPGAs need higher level abstractions (collective communication) for complex communication patterns
  - Previous attempts: offload network stack to FPGA
    - Problem : Distributed applications often require high-level communication patterns like broadcasts, reductions, and barriers. These patterns involve complex coordination and data movement strategies that go beyond basic point-to-point communication. Simply offloading the network stack does not provide these higher-level abstractions.
    - Problem : Without high-level abstractions, developers must manually implement these communication patterns, leading to increased complexity and potential for errors.
  - Summary: need high level abstraction that can synchronize/customize data movement
- These problems make FPGAs rely on CPU for collective networking
  - significantly increases latency of data transfers between FPGAs

Portability Challenges for Implementing High-Performance / Collective Abstractions:

Untitled

FPGA application relies on CPU network stack for inter-FPGA data movement

Untitled

FPGA CCL enables direct networking between FPGAs

Untitled

CCL implementation for FGPA application with shared virtual memory

Untitled

CCL’s role as an offload engine for a CPU application with shared virtual memory

Communication Models
- Message Passing (MPI): communication occurs between buffers in memory
- Streaming: communication occurs through continuous data streams
  - FPGA kernels support direct streaming interfaces
  - data is directly pushed into streaming interface in a pipelined fashion during processing
  - Problem: existing streaming communication framework often does not have transport protocols or collective abstractions
Flexibility: Support for diverse communication protocols
- For application-specific needs
- Ensure interoperability in heterogenous environments where FPGAs coexist with CPUs and accelerators

Direct memory transfer between GPU memories through RDMA NIC is present
1. not applicable to FPGA based collective different
  1. can connect directly to network
  2. do not need external NIC on FPGA
Galapagos & EasyNet: provide in-FPGA communication stacks for data exchange within cluster
TMD-MPI: orchestrates in-FPGA collectives using embedded processors
1. bottleneck: control due to sequential execution in low-frequency FPGA microprocessors
Collective offload with NetFPGA
1. problem: static collective offload engines limit flexibility
2. rely on software-defined network switches for orchestration
SMI: streaming message-passing model
1. expose streaming collective interface to FPGA kernels
2. SMI enables kernels to initiate collectives directly
3. Problem: employs dedicated FPGA logic for collective control
  1. limit flexibility for post-synthesis reconfiguration
ACCL: focuses on message-passing collectives for FPGA applications
1. coordination of collectives requires CPU involvement
2. lacks streaming support
BluesMPI: offload collective operations to Smart-NIC, demonstrating comparable communication latency to host-based collectives
1. Problem: does not target accelerator applications