{"id":35124286,"url":"https://github.com/crazyguitar/libefaxx","last_synced_at":"2026-01-16T21:00:48.107Z","repository":{"id":330640594,"uuid":"1119836270","full_name":"crazyguitar/Libefaxx","owner":"crazyguitar","description":null,"archived":false,"fork":false,"pushed_at":"2026-01-13T09:36:10.000Z","size":306,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-13T22:55:14.287Z","etag":null,"topics":["aws","benchmark","cpp20-coroutine","cuda","efa","gpu","gpu-benchmarks","hpc","large-language-models","llm","rdma","rdma-benchmarks"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crazyguitar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.bib","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-19T23:58:04.000Z","updated_at":"2026-01-13T09:36:13.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/crazyguitar/Libefaxx","commit_stats":null,"previous_names":["crazyguitar/libefaxx"],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/crazyguitar/Libefaxx","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crazyguitar%2FLibefaxx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crazyguitar%2FLibefaxx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crazyguitar%2FLibefaxx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crazyguitar%2FLibefaxx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crazyguitar","download_url":"https://codeload.github.com/crazyguitar/Libefaxx/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crazyguitar%2FLibefaxx/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28482475,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","benchmark","cpp20-coroutine","cuda","efa","gpu","gpu-benchmarks","hpc","large-language-models","llm","rdma","rdma-benchmarks"],"created_at":"2025-12-28T01:37:57.599Z","updated_at":"2026-01-16T21:00:48.089Z","avatar_url":"https://github.com/crazyguitar.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Libefaxx (AWS EFA Benchmark for GPU/CPU)\n\nHigh-performance inter-node communication over AWS Elastic Fabric Adapter (EFA)\nis a key enabler for scaling large-language-model (LLM) training efficiently.\nExisting benchmarking tools primarily focus on collective communication libraries\nsuch as [NCCL](https://github.com/NVIDIA/nccl) or [NVSHMEM](https://github.com/NVIDIA/nvshmem),\nmaking it difficult to isolate and understand the raw performance characteristics\nof EFA itself. At the same time, [GPU-Initiated Networking](https://arxiv.org/pdf/2511.15076) (GIN)\nhas gained significant attention following the release of [Deep-EP](https://github.com/deepseek-ai/DeepEP),\nwhich demonstrated substantial MoE performance gains by enabling GPU-driven\ncommunication.\n\nThis repository provides a focused benchmarking framework for EFA, designed to\nanalyze low-level inter-node communication performance. It complements existing\ntools such as [nccl-tests](https://github.com/NVIDIA/nccl-tests) by enabling\ndirect measurement of EFA latency, bandwidth, and GIN behavior, helping\nengineers and researchers optimize distributed training pipelines on AWS. You\ncan find an example of evaluation on p5.48xlarge [here](experiments).\n\n## Development\n\nThe following snippets demonstrate how to build the source code for a simple test.\nTo save time on environment setup and dependency management, this repository provides\na [Dockerfile](Dockerfile) that can be used to build the project in a consistent\nand reproducible environment.\n\n```bash\n# build a Docker image\ndocker build -f Dockerfile -t cuda:latest .\n\n# build examples\nmake build\n```\n\nIf [enroot](https://github.com/NVIDIA/enroot) is available in your environment,\nyou can launch the experiment using the following commands:\n\n```bash\n# build an enroot sqush file\nmake sqush\n\n# launch an interactive enroot environment\nenroot create --name cuda cuda+latest.sqsh\nenroot start --mount /fsx:/fsx cuda /bin/bash\n\n# run a test via enroot on a Slurm cluster\nsrun -N 1 \\\n  --container-image \"${PWD}/cuda+latest.sqsh\"  \\\n  --container-mounts /fsx:/fsx \\\n  --container-name cuda \\\n  --mpi=pmix \\\n  --ntasks-per-node=1 \\\n  \"${PWD}/build/experiments/affinity/affinity\"\n```\n\n## Example\n\nWhen implementing custom algorithms directly over EFA, developers often face the\ncomplexity of asynchronous RDMA APIs and event-driven scheduling. To simplify\nthis workflow, this repository includes a coroutine-based scheduler built on\n[C++20 coroutine](https://en.cppreference.com/w/cpp/language/coroutines.html),\nenabling a more straightforward programming model without manual callback management.\nThe example below shows how to build a PoC using pure [libfabric](https://github.com/ofiwg/libfabric/) and [MPI](https://www.open-mpi.org/).\n\n```cpp\n#include \u003cbench/arguments.h\u003e\n#include \u003crdma/fabric/memory.h\u003e\n#include \u003cbench/modules/sendrecv.cuh\u003e\n#include \u003cbench/mpi/fabric.cuh\u003e\n\n// mpirun -np 2 --npernode 1 example\n\nint main(int argc, char *argv[]) {\n  size_t bufsize = 128 \u003c\u003c 10; // 128k\n  FabricBench peer;\n  peer.Exchange();\n  peer.Connect();\n  int rank = peer.mpi.GetWorldRank();\n\n  auto send = peer.Alloc\u003cfi::SymmetricDMAMemory\u003e(bufsize, rank);\n  auto recv = peer.Alloc\u003cfi::SymmetricDMAMemory\u003e(bufsize, -1);\n  peer.Handshake(send, recv);\n\n  auto verify = [](auto\u0026, auto\u0026) {};\n  auto result = peer.Bench(\"test\", send, recv, PairBench\u003cFabricBench, fi::FabricSelector\u003e{1}, verify, 100);\n  return 0;\n}\n```\n\nTo learn how to use the library provided in this repository, please refer to the\nfollowing example experiments, which illustrate common usage patterns and benchmarking scenarios:\n\n* [Affinity](experiments/affinity): Demonstrates how to query and enumerate GPU device information.\n* [EFA](experiments/efa): Shows how to discover and inspect available EFA devices.\n* [Echo](experiments/echo): Implements a simple TCP echo server/client to illustrate usage of the coroutine-based scheduler.\n* [Bootstrap](experiments/bootstrap): Illustrates exchanging RDMA details via MPI communication.\n* [Send\\/Recv](experiments/sendrecv): Benchmarks libfabric SEND/RECV operations over EFA.\n* [Write](experiments/write): Benchmarks libfabric WRITE operations over EFA.\n* [Alltoall](experiments/all2all): Benchmarks a simple all-to-all communication pattern over EFA.\n* [Queue](experiments/queue): Benchmarks a multi-producer, single-consumer (MPSC) queue between GPU and CPU.\n* [Proxy](experiments/proxy): Benchmarks GPU-initiated RDMA writes via a CPU proxy coroutine.\n* [IPC](experiments/ipc): Benchmarks intra-node GPU-to-GPU communication via CUDA IPC.\n* [Shmem](experiments/shmem): NVSHMEM-like API example demonstrating `shmem_*` interface over EFA.\n\n## Citation\n\nSee [CITATION.cff](CITATION.cff) for machine-readable citation information.\n\n### BibTeX\n```bibtex\n@software{tsai2026aws_efa_gpu_benchmark,\n  title = {AWS EFA GPU Benchmark},\n  author = {Tsai, Chang-Ning},\n  year = {2026},\n  month = {1},\n  url = {https://github.com/crazyguitar/Libefaxx},\n  version = {0.3.1},\n  abstract = {High-performance RDMA communication experiments using CUDA and Amazon Elastic Fabric Adapter (EFA)},\n  keywords = {RDMA, CUDA, EFA, High-Performance Computing, GPU Communication, Amazon EFA, Fabric, MPI}\n}\n```\n\n### APA Style\nTsai, C.-N. (2026). *AWS EFA GPU Benchmark* (Version 0.3.1) [Computer software]. https://github.com/crazyguitar/Libefaxx\n\n## References\n\n1. Q. Le, \"Libfabric EFA Series,\" 2024. [\\[link\\]](https://le.qun.ch/en/blog/2024/12/25/libfabric-efa-0-intro/)\n2. K. Punniyamurthy et al., \"Optimizing Distributed ML Communication,\" arXiv:2305.06942, 2023. [\\[link\\]](https://arxiv.org/pdf/2305.06942)\n3. S. Liu et al., \"GPU-Initiated Networking,\" arXiv:2511.15076, 2025. [\\[link\\]](https://arxiv.org/abs/2511.15076)\n4. Netcan, \"asyncio: C++20 coroutine library,\" GitHub. [\\[link\\]](https://github.com/netcan/asyncio)\n5. UCCL Project, \"UCCL: User-space Collective Communication Library,\" GitHub. [\\[link\\]](https://github.com/uccl-project/uccl)\n6. Microsoft, \"MSCCL++: Multi-Scale Collective Communication Library,\" GitHub. [\\[link\\]](https://github.com/microsoft/mscclpp)\n7. DeepSeek-AI, \"DeepEP: Expert parallelism with GPU-initiated communication,\" GitHub. [\\[link\\]](https://github.com/deepseek-ai/DeepEP)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrazyguitar%2Flibefaxx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrazyguitar%2Flibefaxx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrazyguitar%2Flibefaxx/lists"}