{"id":34171509,"url":"https://github.com/foundation-model-stack/ocp-efa-operator","last_synced_at":"2026-03-11T07:01:44.524Z","repository":{"id":208564366,"uuid":"721196336","full_name":"foundation-model-stack/ocp-efa-operator","owner":"foundation-model-stack","description":"Operator that enables EFA and/or GDRCOPY in an OpenShift cluster","archived":false,"fork":false,"pushed_at":"2025-07-23T00:43:00.000Z","size":121,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-12-18T07:58:38.692Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/foundation-model-stack.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-20T14:53:01.000Z","updated_at":"2025-07-23T00:42:58.000Z","dependencies_parsed_at":"2023-11-22T06:27:27.156Z","dependency_job_id":"26560d14-142d-4535-8c86-c72470c460f6","html_url":"https://github.com/foundation-model-stack/ocp-efa-operator","commit_stats":null,"previous_names":["foundation-model-stack/ocp-efa-operator"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/foundation-model-stack/ocp-efa-operator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/foundation-model-stack%2Focp-efa-operator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/foundation-model-stack%2Focp-efa-operator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/foundation-model-stack%2Focp-efa-operator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/foundation-model-stack%2Focp-efa-operator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/foundation-model-stack","download_url":"https://codeload.github.com/foundation-model-stack/ocp-efa-operator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/foundation-model-stack%2Focp-efa-operator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30373508,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-11T06:09:32.197Z","status":"ssl_error","status_checked_at":"2026-03-11T06:09:17.086Z","response_time":84,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-12-15T11:21:47.832Z","updated_at":"2026-03-11T07:01:44.519Z","avatar_url":"https://github.com/foundation-model-stack.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ocp-efa-operator\n\nocp-efa-operator enables GPUDirect RDMA in an OpenShift cluster over AWS Elastic Fabric Adapter (EFA) so that distributed machine learning jobs speed up with optimized high performance networking. It deploys a required kernel module ([efa](https://github.com/amzn/amzn-drivers/tree/master/kernel/linux/efa)) and device plugin to let user pods access the special device functionality without privilege via a custom resource `fms.io/efa`. This also automates deployment of the kernel module for [GDRCOPY](https://github.com/NVIDIA/gdrcopy) (`gdrdrv`) and exposes `fms.io/gdrdrv` for user pods to optimize accessing GPU memory from host CPU by leveraging local RDMA.\n\n## Description\n\nWe expect that cluster admins manage the operator and two cluster-wide CRDs (`efadrivers` and `gdrdrvdrivers`) to enable EFA and/or GDRCOPY in an OpenShift cluster.\nThe operator automatically detect EFA devices and CUDA drivers in a cluster and deploys required kernel modules.\n\nExample pod specs are available under `test/*.yaml`. User pods can mount `/dev/gdrdrv` or `/dev/infiniband/uverbs*` by adding `fms.io/gdrdrv:1` or `fms.io/efa:1` at `resource.limits`.\n\nUser pods need custom container images with special user libraries.\nExample Dockerfiles are also available under `test/Dockerfile.*`.\n\n## Getting Started\nYou’ll need a Kubernetes cluster to run against. You can use [KIND](https://sigs.k8s.io/kind) to get a local cluster for testing, or run against a remote cluster.\n**Note:** Your controller will automatically use the current context in your kubeconfig file (i.e. whatever cluster `kubectl cluster-info` shows).\n\n### Prerequisite\n\nThe operator depends on some operators. The below operators must be deployed in a target cluster.\n\n+ [GPU operator](https://github.com/NVIDIA/gpu-operator)\n+ [Kernel module management operator](https://github.com/kubernetes-sigs/kernel-module-management)\n+ [Node feature discovery](https://github.com/openshift/node-feature-discovery)\n\nocp-efa-operator will look up node labels for versions of cuda driver and linux kernels that the two operators generate.\nIt also utilizes Node feature discovery for EFA device detection at each node.\n\nNote: currently, ocp-efa-operator can block unisntalling/upgrading GPU operator since it deploys gdrdrv depnding on nvidia kernel modules.\nAdmins need to carefully cleanup ocp-efa-operator at first before uninstalling/upgrading GPU operators.\n\n\n### Install with a public bundle image\n\nRun bundle with a public image\n```bash\n$ oc new-project ocp-efa-operator\n$ operator-sdk run bundle ghcr.io/foundation-model-stack/ocp-efa-operator-bundle:v0.0.1 --namespace ocp-efa-operator\n```\n\nDeploy GdrdrvDriver\n```bash\n$ oc apply -f config/sample/efa_v1alpha1_gdrdrvdriver.yaml\n```\n\nDeploy EfaDriver\n```bash\n$ oc apply -f config/sample/efa_v1alpha1_efadriver.yaml\n```\n\nWait until the cluster gets ready\n```bash\n$ oc get efadrivers\nNAME   STATUS\nocp    Ready\n$ oc get gdrdrvdrivers\nNAME   STATUS\nocp    Ready\n```\n\n### Advanced: Build and install from source\n\nBuild and push the device-plugin image\n```bash\n$ make dp-push IMG=myrepo.io/ocp-efa-device-plugin:v0.0.1\n```\n\nBuild and push the operator image\n```bash\n$ make operator-push IMG=myrepo.io/ocp-efa-operator:v0.0.1\n```\n\nBuild and push the bundle image\n```bash\n$ make bundle-push IMG=myrepo.io/ocp-efa-operator-bundle:v0.0.1\n```\n\nRun bundle with the pushed image at your namespace and secret\n```bash\n$ oc new-project ocp-efa-operator\n$ oc create secret generic mysecret -n ocp-efa-operator --from-file=.dockerconfigfile=my.docker.config --type=kubernetes.io/dockerconfigjson\n$ operator-sdk run bundle myrepo.io/ocp-efa-operator-bundle:v0.0.1 --pull-secret-name mysecret --namespace ocp-efa-operator\n```\n\nDeploy GdrdrvDriver\n```bash\n$ oc apply -f config/sample/efa_v1alpha1_gdrdrvdriver.yaml\n```\n\nDeploy EfaDriver\n```bash\n$ oc apply -f config/sample/efa_v1alpha1_efadriver.yaml\n```\n\nWait until the cluster gets ready\n```bash\n$ oc get efadrivers\nNAME   STATUS\nocp    Ready\n$ oc get gdrdrvdrivers\nNAME   STATUS\nocp    Ready\n```\n\n### Testing\n\n*GDRCOPY:*\n\nBuild a gdrcopy test image\n```bash\n$ docker build -f test/Dockerfile.gdrcopy -t myrepo.io/gdrcopy-test:ocp-efa-v0.0.1 ./test\n```\n\nTry testing a gdrcopy test job\n```bash\n$ vim test/gdrcopy-test.yaml # modify image and imagePullSecrets\n$ oc apply -f test/gdrcopy-test.yaml\n```\n\n*EFA Pingpong and NCCL Tests AllReduce:*\n\nBuild a nccl test image\n```bash\n$ docker build -f test/Dockerfile.nccl-tests -t myrepo.io/nccl-tests:ocp-efa-v0.0.1 ./test\n```\n\nTry testing an efa pingpong\n```bash\n$ vim test/efa-test-pingpong.yaml # modify image and imagePullSecrets\n$ oc apply -f test/gdrcopy-test.yaml\n```\n\nTry testing a nccl-tests on mpijob (requires [training operator](https://github.com/kubeflow/training-operator))\n```bash\n$ vim test/efa-test-pingpong.yaml # modify image and imagePullSecrets\n$ oc apply -f test/gdrcopy-test.yaml\n```\n\n*Pytorch AllReduce:*\n\nBuild a pytorch test image\n```bash\n$ docker build -f test/Dockerfile.pytorch -t myrepo.io/pytorch:ocp-efa-v0.0.1 ./test\n```\n\nTry testing a pytorch (requires [training operator](https://github.com/kubeflow/training-operator))\n```bash\n$ vim test/efa-test-pingpong.yaml # modify image and imagePullSecrets\n$ oc apply -f test/gdrcopy-test.yaml\n```\n\n### Cleanup\n\nDelete CRs\n```bash\n$ oc delete -f config/sample/efa_v1alpha1_efadriver.yaml\n$ oc delete -f config/sample/efa_v1alpha1_gdrdrvdriver.yaml\n```\n\nCleanup bundle\n```bash\n$ operator-sdk cleanup ocp-efa-operator --delete-all\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffoundation-model-stack%2Focp-efa-operator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffoundation-model-stack%2Focp-efa-operator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffoundation-model-stack%2Focp-efa-operator/lists"}