https://github.com/crazyguitar/libefaxx

aws benchmark cpp20-coroutine cuda efa gpu gpu-benchmarks hpc large-language-models llm rdma rdma-benchmarks

Last synced: 4 days ago
JSON representation

Host: GitHub
URL: https://github.com/crazyguitar/libefaxx
Owner: crazyguitar
License: other
Created: 2025-12-19T23:58:04.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-01-13T09:36:10.000Z (7 days ago)
Last Synced: 2026-01-13T22:55:14.287Z (7 days ago)
Topics: aws, benchmark, cpp20-coroutine, cuda, efa, gpu, gpu-benchmarks, hpc, large-language-models, llm, rdma, rdma-benchmarks
Language: C++
Homepage:
Size: 299 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.bib

Awesome Lists containing this project

README

          # Libefaxx (AWS EFA Benchmark for GPU/CPU)

High-performance inter-node communication over AWS Elastic Fabric Adapter (EFA)

is a key enabler for scaling large-language-model (LLM) training efficiently.

Existing benchmarking tools primarily focus on collective communication libraries

such as [NCCL](https://github.com/NVIDIA/nccl) or [NVSHMEM](https://github.com/NVIDIA/nvshmem),

making it difficult to isolate and understand the raw performance characteristics

of EFA itself. At the same time, [GPU-Initiated Networking](https://arxiv.org/pdf/2511.15076) (GIN)

has gained significant attention following the release of [Deep-EP](https://github.com/deepseek-ai/DeepEP),

which demonstrated substantial MoE performance gains by enabling GPU-driven

communication.

This repository provides a focused benchmarking framework for EFA, designed to

analyze low-level inter-node communication performance. It complements existing

tools such as [nccl-tests](https://github.com/NVIDIA/nccl-tests) by enabling

direct measurement of EFA latency, bandwidth, and GIN behavior, helping

engineers and researchers optimize distributed training pipelines on AWS. You

can find an example of evaluation on p5.48xlarge [here](experiments).

## Development

The following snippets demonstrate how to build the source code for a simple test.

To save time on environment setup and dependency management, this repository provides

a [Dockerfile](Dockerfile) that can be used to build the project in a consistent

and reproducible environment.

```bash

# build a Docker image

docker build -f Dockerfile -t cuda:latest .

# build examples

make build

```

If [enroot](https://github.com/NVIDIA/enroot) is available in your environment,

you can launch the experiment using the following commands:

```bash

# build an enroot sqush file

make sqush

# launch an interactive enroot environment

enroot create --name cuda cuda+latest.sqsh

enroot start --mount /fsx:/fsx cuda /bin/bash

# run a test via enroot on a Slurm cluster

srun -N 1 \

  --container-image "${PWD}/cuda+latest.sqsh"  \

  --container-mounts /fsx:/fsx \

  --container-name cuda \

  --mpi=pmix \

  --ntasks-per-node=1 \

  "${PWD}/build/experiments/affinity/affinity"

```

## Example

When implementing custom algorithms directly over EFA, developers often face the

complexity of asynchronous RDMA APIs and event-driven scheduling. To simplify

this workflow, this repository includes a coroutine-based scheduler built on

[C++20 coroutine](https://en.cppreference.com/w/cpp/language/coroutines.html),

enabling a more straightforward programming model without manual callback management.

The example below shows how to build a PoC using pure [libfabric](https://github.com/ofiwg/libfabric/) and [MPI](https://www.open-mpi.org/).

```cpp

#include 

#include 

#include 

#include 

// mpirun -np 2 --npernode 1 example

int main(int argc, char *argv[]) {

  size_t bufsize = 128 << 10; // 128k

  FabricBench peer;

  peer.Exchange();

  peer.Connect();

  int rank = peer.mpi.GetWorldRank();

  auto send = peer.Alloc(bufsize, rank);

  auto recv = peer.Alloc(bufsize, -1);

  peer.Handshake(send, recv);

  auto verify = [](auto&, auto&) {};

  auto result = peer.Bench("test", send, recv, PairBench{1}, verify, 100);

  return 0;

}

```

To learn how to use the library provided in this repository, please refer to the

following example experiments, which illustrate common usage patterns and benchmarking scenarios:

* [Affinity](experiments/affinity): Demonstrates how to query and enumerate GPU device information.

* [EFA](experiments/efa): Shows how to discover and inspect available EFA devices.

* [Echo](experiments/echo): Implements a simple TCP echo server/client to illustrate usage of the coroutine-based scheduler.

* [Bootstrap](experiments/bootstrap): Illustrates exchanging RDMA details via MPI communication.

* [Send\/Recv](experiments/sendrecv): Benchmarks libfabric SEND/RECV operations over EFA.

* [Write](experiments/write): Benchmarks libfabric WRITE operations over EFA.

* [Alltoall](experiments/all2all): Benchmarks a simple all-to-all communication pattern over EFA.

* [Queue](experiments/queue): Benchmarks a multi-producer, single-consumer (MPSC) queue between GPU and CPU.

* [Proxy](experiments/proxy): Benchmarks GPU-initiated RDMA writes via a CPU proxy coroutine.

* [IPC](experiments/ipc): Benchmarks intra-node GPU-to-GPU communication via CUDA IPC.

* [Shmem](experiments/shmem): NVSHMEM-like API example demonstrating `shmem_*` interface over EFA.

## Citation

See [CITATION.cff](CITATION.cff) for machine-readable citation information.

### BibTeX

```bibtex

@software{tsai2026aws_efa_gpu_benchmark,

  title = {AWS EFA GPU Benchmark},

  author = {Tsai, Chang-Ning},

  year = {2026},

  month = {1},

  url = {https://github.com/crazyguitar/Libefaxx},

  version = {0.3.1},

  abstract = {High-performance RDMA communication experiments using CUDA and Amazon Elastic Fabric Adapter (EFA)},

  keywords = {RDMA, CUDA, EFA, High-Performance Computing, GPU Communication, Amazon EFA, Fabric, MPI}

}

```

### APA Style

Tsai, C.-N. (2026). *AWS EFA GPU Benchmark* (Version 0.3.1) [Computer software]. https://github.com/crazyguitar/Libefaxx

## References

1. Q. Le, "Libfabric EFA Series," 2024. [\[link\]](https://le.qun.ch/en/blog/2024/12/25/libfabric-efa-0-intro/)

2. K. Punniyamurthy et al., "Optimizing Distributed ML Communication," arXiv:2305.06942, 2023. [\[link\]](https://arxiv.org/pdf/2305.06942)

3. S. Liu et al., "GPU-Initiated Networking," arXiv:2511.15076, 2025. [\[link\]](https://arxiv.org/abs/2511.15076)

4. Netcan, "asyncio: C++20 coroutine library," GitHub. [\[link\]](https://github.com/netcan/asyncio)

5. UCCL Project, "UCCL: User-space Collective Communication Library," GitHub. [\[link\]](https://github.com/uccl-project/uccl)

6. Microsoft, "MSCCL++: Multi-Scale Collective Communication Library," GitHub. [\[link\]](https://github.com/microsoft/mscclpp)

7. DeepSeek-AI, "DeepEP: Expert parallelism with GPU-initiated communication," GitHub. [\[link\]](https://github.com/deepseek-ai/DeepEP)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/crazyguitar/libefaxx

Awesome Lists containing this project

README