Background
Problems of Modern FPGA as SmartNIC:
- Difficult to use FPGA in distributed applications:
- Network stack on FPGA is not compatible with data center infra.
- Data center infra requires complex network transfers
- FPGAs need higher level abstractions (collective communication) for complex communication patterns
- Previous attempts: offload network stack to FPGA
- Problem : Distributed applications often require high-level communication patterns like broadcasts, reductions, and barriers. These patterns involve complex coordination and data movement strategies that go beyond basic point-to-point communication. Simply offloading the network stack does not provide these higher-level abstractions.
- Problem : Without high-level abstractions, developers must manually implement these communication patterns, leading to increased complexity and potential for errors.
- Summary: need high level abstraction that can synchronize/customize data movement
- These problems make FPGAs rely on CPU for collective networking
- significantly increases latency of data transfers between FPGAs
Portability Challenges for Implementing High-Performance / Collective Abstractions:

FPGA application relies on CPU network stack for inter-FPGA data movement

FPGA CCL enables direct networking between FPGAs

CCL implementation for FGPA application with shared virtual memory

CCL’s role as an offload engine for a CPU application with shared virtual memory
- Communication Models
- Message Passing (MPI): communication occurs between buffers in memory
- Streaming: communication occurs through continuous data streams
- FPGA kernels support direct streaming interfaces
- data is directly pushed into streaming interface in a pipelined fashion during processing
- Problem: existing streaming communication framework often does not have transport protocols or collective abstractions
- Flexibility: Support for diverse communication protocols
- For application-specific needs
- Ensure interoperability in heterogenous environments where FPGAs coexist with CPUs and accelerators
Related Work
- Direct memory transfer between GPU memories through RDMA NIC is present
- not applicable to FPGA based collective different
- can connect directly to network
- do not need external NIC on FPGA
- Galapagos & EasyNet: provide in-FPGA communication stacks for data exchange within cluster
- TMD-MPI: orchestrates in-FPGA collectives using embedded processors
- bottleneck: control due to sequential execution in low-frequency FPGA microprocessors
- Collective offload with NetFPGA
- problem: static collective offload engines limit flexibility
- rely on software-defined network switches for orchestration
- SMI: streaming message-passing model
- expose streaming collective interface to FPGA kernels
- SMI enables kernels to initiate collectives directly
- Problem: employs dedicated FPGA logic for collective control
- limit flexibility for post-synthesis reconfiguration
- ACCL: focuses on message-passing collectives for FPGA applications
- coordination of collectives requires CPU involvement
- lacks streaming support
- BluesMPI: offload collective operations to Smart-NIC, demonstrating comparable communication latency to host-based collectives
- Problem: does not target accelerator applications