https://github.com/crazyguitar/libefaxx
https://github.com/crazyguitar/libefaxx
aws benchmark cpp20-coroutine cuda efa gpu gpu-benchmarks hpc large-language-models llm rdma rdma-benchmarks
Last synced: 4 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/crazyguitar/libefaxx
- Owner: crazyguitar
- License: other
- Created: 2025-12-19T23:58:04.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-01-13T09:36:10.000Z (7 days ago)
- Last Synced: 2026-01-13T22:55:14.287Z (7 days ago)
- Topics: aws, benchmark, cpp20-coroutine, cuda, efa, gpu, gpu-benchmarks, hpc, large-language-models, llm, rdma, rdma-benchmarks
- Language: C++
- Homepage:
- Size: 299 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.bib
Awesome Lists containing this project
README
# Libefaxx (AWS EFA Benchmark for GPU/CPU)
High-performance inter-node communication over AWS Elastic Fabric Adapter (EFA)
is a key enabler for scaling large-language-model (LLM) training efficiently.
Existing benchmarking tools primarily focus on collective communication libraries
such as [NCCL](https://github.com/NVIDIA/nccl) or [NVSHMEM](https://github.com/NVIDIA/nvshmem),
making it difficult to isolate and understand the raw performance characteristics
of EFA itself. At the same time, [GPU-Initiated Networking](https://arxiv.org/pdf/2511.15076) (GIN)
has gained significant attention following the release of [Deep-EP](https://github.com/deepseek-ai/DeepEP),
which demonstrated substantial MoE performance gains by enabling GPU-driven
communication.
This repository provides a focused benchmarking framework for EFA, designed to
analyze low-level inter-node communication performance. It complements existing
tools such as [nccl-tests](https://github.com/NVIDIA/nccl-tests) by enabling
direct measurement of EFA latency, bandwidth, and GIN behavior, helping
engineers and researchers optimize distributed training pipelines on AWS. You
can find an example of evaluation on p5.48xlarge [here](experiments).
## Development
The following snippets demonstrate how to build the source code for a simple test.
To save time on environment setup and dependency management, this repository provides
a [Dockerfile](Dockerfile) that can be used to build the project in a consistent
and reproducible environment.
```bash
# build a Docker image
docker build -f Dockerfile -t cuda:latest .
# build examples
make build
```
If [enroot](https://github.com/NVIDIA/enroot) is available in your environment,
you can launch the experiment using the following commands:
```bash
# build an enroot sqush file
make sqush
# launch an interactive enroot environment
enroot create --name cuda cuda+latest.sqsh
enroot start --mount /fsx:/fsx cuda /bin/bash
# run a test via enroot on a Slurm cluster
srun -N 1 \
--container-image "${PWD}/cuda+latest.sqsh" \
--container-mounts /fsx:/fsx \
--container-name cuda \
--mpi=pmix \
--ntasks-per-node=1 \
"${PWD}/build/experiments/affinity/affinity"
```
## Example
When implementing custom algorithms directly over EFA, developers often face the
complexity of asynchronous RDMA APIs and event-driven scheduling. To simplify
this workflow, this repository includes a coroutine-based scheduler built on
[C++20 coroutine](https://en.cppreference.com/w/cpp/language/coroutines.html),
enabling a more straightforward programming model without manual callback management.
The example below shows how to build a PoC using pure [libfabric](https://github.com/ofiwg/libfabric/) and [MPI](https://www.open-mpi.org/).
```cpp
#include
#include
#include
#include
// mpirun -np 2 --npernode 1 example
int main(int argc, char *argv[]) {
size_t bufsize = 128 << 10; // 128k
FabricBench peer;
peer.Exchange();
peer.Connect();
int rank = peer.mpi.GetWorldRank();
auto send = peer.Alloc(bufsize, rank);
auto recv = peer.Alloc(bufsize, -1);
peer.Handshake(send, recv);
auto verify = [](auto&, auto&) {};
auto result = peer.Bench("test", send, recv, PairBench{1}, verify, 100);
return 0;
}
```
To learn how to use the library provided in this repository, please refer to the
following example experiments, which illustrate common usage patterns and benchmarking scenarios:
* [Affinity](experiments/affinity): Demonstrates how to query and enumerate GPU device information.
* [EFA](experiments/efa): Shows how to discover and inspect available EFA devices.
* [Echo](experiments/echo): Implements a simple TCP echo server/client to illustrate usage of the coroutine-based scheduler.
* [Bootstrap](experiments/bootstrap): Illustrates exchanging RDMA details via MPI communication.
* [Send\/Recv](experiments/sendrecv): Benchmarks libfabric SEND/RECV operations over EFA.
* [Write](experiments/write): Benchmarks libfabric WRITE operations over EFA.
* [Alltoall](experiments/all2all): Benchmarks a simple all-to-all communication pattern over EFA.
* [Queue](experiments/queue): Benchmarks a multi-producer, single-consumer (MPSC) queue between GPU and CPU.
* [Proxy](experiments/proxy): Benchmarks GPU-initiated RDMA writes via a CPU proxy coroutine.
* [IPC](experiments/ipc): Benchmarks intra-node GPU-to-GPU communication via CUDA IPC.
* [Shmem](experiments/shmem): NVSHMEM-like API example demonstrating `shmem_*` interface over EFA.
## Citation
See [CITATION.cff](CITATION.cff) for machine-readable citation information.
### BibTeX
```bibtex
@software{tsai2026aws_efa_gpu_benchmark,
title = {AWS EFA GPU Benchmark},
author = {Tsai, Chang-Ning},
year = {2026},
month = {1},
url = {https://github.com/crazyguitar/Libefaxx},
version = {0.3.1},
abstract = {High-performance RDMA communication experiments using CUDA and Amazon Elastic Fabric Adapter (EFA)},
keywords = {RDMA, CUDA, EFA, High-Performance Computing, GPU Communication, Amazon EFA, Fabric, MPI}
}
```
### APA Style
Tsai, C.-N. (2026). *AWS EFA GPU Benchmark* (Version 0.3.1) [Computer software]. https://github.com/crazyguitar/Libefaxx
## References
1. Q. Le, "Libfabric EFA Series," 2024. [\[link\]](https://le.qun.ch/en/blog/2024/12/25/libfabric-efa-0-intro/)
2. K. Punniyamurthy et al., "Optimizing Distributed ML Communication," arXiv:2305.06942, 2023. [\[link\]](https://arxiv.org/pdf/2305.06942)
3. S. Liu et al., "GPU-Initiated Networking," arXiv:2511.15076, 2025. [\[link\]](https://arxiv.org/abs/2511.15076)
4. Netcan, "asyncio: C++20 coroutine library," GitHub. [\[link\]](https://github.com/netcan/asyncio)
5. UCCL Project, "UCCL: User-space Collective Communication Library," GitHub. [\[link\]](https://github.com/uccl-project/uccl)
6. Microsoft, "MSCCL++: Multi-Scale Collective Communication Library," GitHub. [\[link\]](https://github.com/microsoft/mscclpp)
7. DeepSeek-AI, "DeepEP: Expert parallelism with GPU-initiated communication," GitHub. [\[link\]](https://github.com/deepseek-ai/DeepEP)