{"id":13475594,"url":"https://github.com/kubeflow/mpi-operator","last_synced_at":"2025-05-14T16:12:03.401Z","repository":{"id":33747905,"uuid":"134986009","full_name":"kubeflow/mpi-operator","owner":"kubeflow","description":"Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)","archived":false,"fork":false,"pushed_at":"2025-05-13T08:48:21.000Z","size":53496,"stargazers_count":475,"open_issues_count":101,"forks_count":229,"subscribers_count":35,"default_branch":"master","last_synced_at":"2025-05-13T09:40:46.980Z","etag":null,"topics":["apache-mxnet","distributed-computing","horovod","kubeflow","kubernetes","mpi","pytorch","tensorflow"],"latest_commit_sha":null,"homepage":"https://www.kubeflow.org/docs/components/training/mpi/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kubeflow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-05-26T18:34:47.000Z","updated_at":"2025-05-13T08:48:26.000Z","dependencies_parsed_at":"2023-02-19T06:31:22.628Z","dependency_job_id":"e07eb322-98ca-45e9-b862-649dceb0a76c","html_url":"https://github.com/kubeflow/mpi-operator","commit_stats":{"total_commits":350,"total_committers":73,"mean_commits":4.794520547945205,"dds":0.8228571428571428,"last_synced_commit":"c738a83b185b4bf3bf7e6eca9d4503653294c995"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kubeflow%2Fmpi-operator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kubeflow%2Fmpi-operator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kubeflow%2Fmpi-operator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kubeflow%2Fmpi-operator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kubeflow","download_url":"https://codeload.github.com/kubeflow/mpi-operator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254179905,"owners_count":22027884,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-mxnet","distributed-computing","horovod","kubeflow","kubernetes","mpi","pytorch","tensorflow"],"created_at":"2024-07-31T16:01:21.756Z","updated_at":"2025-05-14T16:12:03.382Z","avatar_url":"https://github.com/kubeflow.png","language":"Go","funding_links":[],"categories":["Go","Software"],"sub_categories":["Trends"],"readme":"# MPI Operator\n\n[![Build Status](https://github.com/kubeflow/mpi-operator/workflows/build/badge.svg)](https://github.com/kubeflow/mpi-operator/actions?query=event%3Apush+branch%3Amaster)\n[![Docker Pulls](https://img.shields.io/docker/pulls/mpioperator/mpi-operator)](https://hub.docker.com/r/mpioperator/mpi-operator)\n\nThe MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes. Please check out [this blog post](https://medium.com/kubeflow/introduction-to-kubeflow-mpi-operator-and-industry-adoption-296d5f2e6edc) for an introduction to MPI Operator and its industry adoption.\n\n## Installation\n\nYou can deploy the operator with default settings by running the following commands:\n\n- Latest Development Version\n\n```shell\nkubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml\n```\n\n- Release Version\n\n```shell\nkubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml\n```\n\nAlternatively, follow the [getting started guide](https://www.kubeflow.org/docs/started/getting-started/) to deploy Kubeflow.\n\nAn alpha version of MPI support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow newer than 0.2.0.\n\nYou can check whether the MPI Job custom resource is installed via:\n\n```\nkubectl get crd\n```\n\nThe output should include `mpijobs.kubeflow.org` like the following:\n\n```\nNAME                                       AGE\n...\nmpijobs.kubeflow.org                       4d\n...\n```\n\nIf it is not included, you can add it as follows using [kustomize](https://github.com/kubernetes-sigs/kustomize):\n\n```bash\ngit clone https://github.com/kubeflow/mpi-operator\ncd mpi-operator\nkustomize build manifests/overlays/kubeflow | kubectl apply -f -\n```\n\nNote that since Kubernetes v1.14, `kustomize` became a subcommand in `kubectl` so you can also run the following command instead:\n\nSince Kubernetes v1.21, you can use:\n\n```bash\nkubectl apply -k manifests/overlays/kubeflow\n```\n\n```bash\nkubectl kustomize base | kubectl apply -f -\n```\n\n## Creating an MPI Job\n\nYou can create an MPI job by defining an `MPIJob` config file. See [TensorFlow benchmark example](examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml) config file for launching a multi-node TensorFlow benchmark training job. You may change the config file based on your requirements.\n\n```\ncat examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml\n```\n\nDeploy the `MPIJob` resource to start training:\n\n```\nkubectl apply -f examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml\n```\n\n## Monitoring an MPI Job\n\nOnce the `MPIJob` resource is created, you should now be able to see the created pods matching the specified number of GPUs. You can also monitor the job status from the status section. Here is sample output when the job is successfully completed.\n\n```\nkubectl get -o yaml mpijobs tensorflow-benchmarks\n```\n\n```\napiVersion: kubeflow.org/v2beta1\nkind: MPIJob\nmetadata:\n  creationTimestamp: \"2019-07-09T22:15:51Z\"\n  generation: 1\n  name: tensorflow-benchmarks\n  namespace: default\n  resourceVersion: \"5645868\"\n  selfLink: /apis/kubeflow.org/v1alpha2/namespaces/default/mpijobs/tensorflow-benchmarks\n  uid: 1c5b470f-a297-11e9-964d-88d7f67c6e6d\nspec:\n  runPolicy:\n    cleanPodPolicy: Running\n  mpiReplicaSpecs:\n    Launcher:\n      replicas: 1\n      template:\n        spec:\n          containers:\n          - command:\n            - mpirun\n            - --allow-run-as-root\n            - -np\n            - \"2\"\n            - -bind-to\n            - none\n            - -map-by\n            - slot\n            - -x\n            - NCCL_DEBUG=INFO\n            - -x\n            - LD_LIBRARY_PATH\n            - -x\n            - PATH\n            - -mca\n            - pml\n            - ob1\n            - -mca\n            - btl\n            - ^openib\n            - python\n            - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py\n            - --model=resnet101\n            - --batch_size=64\n            - --variable_update=horovod\n            image: mpioperator/tensorflow-benchmarks:latest\n            name: tensorflow-benchmarks\n    Worker:\n      replicas: 1\n      template:\n        spec:\n          containers:\n          - image: mpioperator/tensorflow-benchmarks:latest\n            name: tensorflow-benchmarks\n            resources:\n              limits:\n                nvidia.com/gpu: 2\n  slotsPerWorker: 2\nstatus:\n  completionTime: \"2019-07-09T22:17:06Z\"\n  conditions:\n  - lastTransitionTime: \"2019-07-09T22:15:51Z\"\n    lastUpdateTime: \"2019-07-09T22:15:51Z\"\n    message: MPIJob default/tensorflow-benchmarks is created.\n    reason: MPIJobCreated\n    status: \"True\"\n    type: Created\n  - lastTransitionTime: \"2019-07-09T22:15:54Z\"\n    lastUpdateTime: \"2019-07-09T22:15:54Z\"\n    message: MPIJob default/tensorflow-benchmarks is running.\n    reason: MPIJobRunning\n    status: \"False\"\n    type: Running\n  - lastTransitionTime: \"2019-07-09T22:17:06Z\"\n    lastUpdateTime: \"2019-07-09T22:17:06Z\"\n    message: MPIJob default/tensorflow-benchmarks successfully completed.\n    reason: MPIJobSucceeded\n    status: \"True\"\n    type: Succeeded\n  replicaStatuses:\n    Launcher:\n      succeeded: 1\n    Worker: {}\n  startTime: \"2019-07-09T22:15:51Z\"\n```\n\nTraining should run for 100 steps and takes a few minutes on a GPU cluster. You can inspect the logs to see the training progress. When the job starts, access the logs from the `launcher` pod:\n\n```\nPODNAME=$(kubectl get pods -l training.kubeflow.org/job-name=tensorflow-benchmarks,training.kubeflow.org/job-role=launcher -o name)\nkubectl logs -f ${PODNAME}\n```\n\n```\nTensorFlow:  1.14\nModel:       resnet101\nDataset:     imagenet (synthetic)\nMode:        training\nSingleSess:  False\nBatch size:  128 global\n             64 per device\nNum batches: 100\nNum epochs:  0.01\nDevices:     ['horovod/gpu:0', 'horovod/gpu:1']\nNUMA bind:   False\nData format: NCHW\nOptimizer:   sgd\nVariables:   horovod\n\n...\n\n40\timages/sec: 154.4 +/- 0.7 (jitter = 4.0)\t8.280\n40\timages/sec: 154.4 +/- 0.7 (jitter = 4.1)\t8.482\n50\timages/sec: 154.8 +/- 0.6 (jitter = 4.0)\t8.397\n50\timages/sec: 154.8 +/- 0.6 (jitter = 4.2)\t8.450\n60\timages/sec: 154.5 +/- 0.5 (jitter = 4.1)\t8.321\n60\timages/sec: 154.5 +/- 0.5 (jitter = 4.4)\t8.349\n70\timages/sec: 154.5 +/- 0.5 (jitter = 4.0)\t8.433\n70\timages/sec: 154.5 +/- 0.5 (jitter = 4.4)\t8.430\n80\timages/sec: 154.8 +/- 0.4 (jitter = 3.6)\t8.199\n80\timages/sec: 154.8 +/- 0.4 (jitter = 3.8)\t8.404\n90\timages/sec: 154.6 +/- 0.4 (jitter = 3.7)\t8.418\n90\timages/sec: 154.6 +/- 0.4 (jitter = 3.6)\t8.459\n100\timages/sec: 154.2 +/- 0.4 (jitter = 4.0)\t8.372\n100\timages/sec: 154.2 +/- 0.4 (jitter = 4.0)\t8.542\n----------------------------------------------------------------\ntotal images/sec: 308.27\n```\n\nFor a sample that uses Intel MPI, see:\n\n```bash\ncat examples/pi/pi-intel.yaml\n```\n\nFor a sample that uses MPICH, see:\n\n```bash\ncat examples/pi/pi-mpich.yaml\n```\n\n## Exposed Metrics\n\n| Metric name | Metric type | Description | Labels |\n| ----------- | ----------- | ----------- | ------ |\n|mpi\\_operator\\_jobs\\_created\\_total | Counter  | Counts number of MPI jobs created | |\n|mpi\\_operator\\_jobs\\_successful\\_total | Counter  | Counts number of MPI jobs successful | |\n|mpi\\_operator\\_jobs\\_failed\\_total | Counter  | Counts number of MPI jobs failed| |\n|mpi\\_operator\\_job\\_info | Gauge | Information about MPIJob | `launcher`=\u0026lt;launcher-pod-name\u0026gt; \u003cbr\u003e `namespace`=\u0026lt;job-namespace\u0026gt; |\n\n### Join Metrics\n\nWith [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics), one can join metrics by labels.\nFor example `kube_pod_info * on(pod,namespace) group_left label_replace(mpi_operator_job_infos, \"pod\", \"$0\", \"launcher\", \".*\")`\n\n## Docker Images\n\nWe push Docker images of [mpioperator on Dockerhub](https://hub.docker.com/u/mpioperator) for every release.\nYou can use the following Dockerfile to build the image yourself:\n\n- [mpi-operator](https://github.com/kubeflow/mpi-operator/blob/master/Dockerfile)\n\nAlternative, you can build the image using make:\n\n```bash\nmake RELEASE_VERSION=dev IMAGE_NAME=registry.example.com/mpi-operator images\n```\n\nThis will produce an image with the tag `registry.example.com/mpi-operator:dev`.\n\n## Contributing\n\nLearn more in [CONTRIBUTING](https://github.com/kubeflow/mpi-operator/blob/master/CONTRIBUTING.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkubeflow%2Fmpi-operator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkubeflow%2Fmpi-operator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkubeflow%2Fmpi-operator/lists"}