{"id":19767415,"url":"https://github.com/tusimple/mxnet-operator","last_synced_at":"2025-10-26T13:06:17.382Z","repository":{"id":68971716,"uuid":"144526603","full_name":"TuSimple/mxnet-operator","owner":"TuSimple","description":"Experimental repository for a MXNet operator ","archived":false,"fork":false,"pushed_at":"2018-08-27T07:49:51.000Z","size":7082,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-11T00:26:02.917Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TuSimple.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-08-13T03:46:33.000Z","updated_at":"2020-07-08T02:43:06.000Z","dependencies_parsed_at":null,"dependency_job_id":"624fd627-67c8-40db-9f15-bd264ae047ec","html_url":"https://github.com/TuSimple/mxnet-operator","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TuSimple%2Fmxnet-operator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TuSimple%2Fmxnet-operator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TuSimple%2Fmxnet-operator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TuSimple%2Fmxnet-operator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TuSimple","download_url":"https://codeload.github.com/TuSimple/mxnet-operator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241100131,"owners_count":19909644,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T04:29:38.939Z","updated_at":"2025-10-26T13:06:17.321Z","avatar_url":"https://github.com/TuSimple.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# mxnet-operator: a Kubernetes operator for mxnet jobs\n\n## Overview\n\nMXJob provides a Kubernetes custom resource that makes it easy to\nrun distributed or non-distributed MXNet jobs on Kubernetes.\n\nUsing a Custom Resource Definition (CRD) gives users the ability to create and manage MX Jobs just like builtin K8s resources. For example to create a job\n\n```\nkubectl create -f examples/mx_job_dist.yaml \n```\n\nTo list jobs\n\n```bash\nkubectl get mxjobs\n\nNAME          CREATED AT\nexample-dist-job   3m\n```\n\n### Requirements\n\nkubelet : v1.11.1\n\nkubeadm : v1.11.1\n\nDocker： \n```\nClient:\n Version:      17.03.2-ce\n API version:  1.27\n Go version:   go1.6.2\n Git commit:   f5ec1e2\n Built:        Thu Jul  5 23:07:48 2018\n OS/Arch:      linux/amd64\n\nServer:\n Version:      17.03.2-ce\n API version:  1.27 (minimum version 1.12)\n Go version:   go1.6.2\n Git commit:   f5ec1e2\n Built:        Thu Jul  5 23:07:48 2018\n OS/Arch:      linux/amd64\n Experimental: false\n```\n\nkubectl :\n\n```\nClient Version: version.Info{Major:\"1\", Minor:\"11\", GitVersion:\"v1.11.1\", GitCommit:\"b1b29978270dc22fecc592ac55d903350454310a\", GitTreeState:\"clean\", BuildDate:\"2018-07-17T18:53:20Z\", GoVersion:\"go1.10.3\", Compiler:\"gc\", Platform:\"linux/amd64\"}\nServer Version: version.Info{Major:\"1\", Minor:\"10\", GitVersion:\"v1.10.5\", GitCommit:\"32ac1c9073b132b8ba18aa830f46b77dcceb0723\", GitTreeState:\"clean\", BuildDate:\"2018-06-21T11:34:22Z\", GoVersion:\"go1.9.3\", Compiler:\"gc\", Platform:\"linux/amd64\"}\n```\n\nkubernetes : branch release-1.11\n\nincubator-mxnet : v1.2.0\n\n## Installing the MXJob CRD and operator on your k8s cluster\n\n### Deploy Kubeflow\n\nmxnet-operator has been contributed to kubeflow , please refer to the [kubeflow installation](https://www.kubeflow.org/docs/started/getting-started/) first .\n\n### Verify that MXNet support is included in your Kubeflow deployment\n\nCheck that the MXNet custom resource is installed\n\n```\nkubectl get crd\n```\n\nThe output should include `mxjobs.kubeflow.org`\n\n```\nNAME                                           AGE\n...\nmxjobs.kubeflow.org                       4d\n...\n```\n\nIf it is not included you can add it as follows\n\n```\ncd ${KSONNET_APP}\nks pkg install kubeflow/mxnet-job\nks generate mxnet-operator mxnet-operator\nks apply ${ENVIRONMENT} -c mxnet-operator\n```\n\n### Creating a job\n\nYou create a job by defining a MXJob and then creating it with.\n\n```\nkubectl create -f examples/mx_job_dist.yaml \n```\n\nEach replicaSpec defines a set of MXNet processes.\nThe mxReplicaType defines the semantics for the set of processes.\nThe semantics are as follows\n\n**scheduler**\n  * A job must have 1 and only 1 scheduler\n  * The pod must contain a container named mxnet\n  * The overall status of the MXJob is determined by the exit code of the\n    mxnet container\n      * 0 = success\n      * 1 || 2 || 126 || 127 || 128 || 139 = permanent errors:\n          * 1: general errors\n          * 2: misuse of shell builtins\n          * 126: command invoked cannot execute\n          * 127: command not found\n          * 128: invalid argument to exit\n          * 139: container terminated by SIGSEGV(Invalid memory reference)\n      * 130 || 137 || 143 = retryable error for unexpected system signals:\n          * 130: container terminated by Control-C\n          * 137: container received a SIGKILL\n          * 143: container received a SIGTERM\n      * 138 = reserved in tf-operator for user specified retryable errors\n      * others = undefined and no guarantee\n\n**worker**\n  * A job can have 0 to N workers\n  * The pod must contain a container named mxnet\n  * Workers are automatically restarted if they exit\n\n**server**\n  * A job can have 0 to N servers\n  * parameter servers are automatically restarted if they exit\n\n\nFor each replica you define a **template** which is a K8s\n[PodTemplateSpec](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core).\nThe template allows you to specify the containers, volumes, etc... that\nshould be created for each replica.\n\n### Using GPUs\n\nMxnet-operator is now supporting the gpu training .\n\nPlease verify your image is available for gpu distributed training .\n\nFor example ,\n\n```\ncommand: [\"python\"]\nargs: [\"/incubator-mxnet/example/image-classification/train_mnist.py\",\"--num-epochs\",\"1\",\"--num-layers\",\"2\",\"--kv-store\",\"dist_device_sync\",\"--gpus\",\"0\"]\nresources:\n  limits:\n    nvidia.com/gpu: 1\n```\n\nMxnet-operator will arrange the pod to nodes which satisfied the gpu limit .\n\n## Monitoring your job\n\nTo get the status of your job\n\n```bash\nkubectl get -o yaml mxjobs $JOB\n```   \n\nHere is sample output for an example job\n\n```yaml\napiVersion: kubeflow.org/v1alpha1\nkind: MXJob\nmetadata:\n  clusterName: \"\"\n  creationTimestamp: 2018-08-10T07:13:39Z\n  generation: 1\n  name: example-dist-job\n  namespace: default\n  resourceVersion: \"491499\"\n  selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/mxjobs/example-dist-job\n  uid: e800b1ed-9c6c-11e8-962f-704d7b2c0a63\nspec:\n  RuntimeId: aycw\n  jobMode: dist\n  mxImage: mxjob/mxnet:gpu\n  replicaSpecs:\n  - PsRootPort: 9000\n    mxReplicaType: SCHEDULER\n    replicas: 1\n    template:\n      metadata:\n        creationTimestamp: null\n      spec:\n        containers:\n        - args:\n          - train_mnist.py\n          command:\n          - python\n          image: mxjob/mxnet:gpu\n          name: mxnet\n          resources: {}\n          workingDir: /incubator-mxnet/example/image-classification\n        restartPolicy: OnFailure\n  - PsRootPort: 9091\n    mxReplicaType: SERVER\n    replicas: 1\n    template:\n      metadata:\n        creationTimestamp: null\n      spec:\n        containers:\n        - args:\n          - train_mnist.py\n          command:\n          - python\n          image: mxjob/mxnet:gpu\n          name: mxnet\n          resources: {}\n          workingDir: /incubator-mxnet/example/image-classification\n        restartPolicy: OnFailure\n  - PsRootPort: 9091\n    mxReplicaType: WORKER\n    replicas: 1\n    template:\n      metadata:\n        creationTimestamp: null\n      spec:\n        containers:\n        - args:\n          - train_mnist.py\n          - --num-epochs=10\n          - --num-layers=2\n          - --kv-store=dist_device_sync\n          command:\n          - python\n          image: mxjob/mxnet:gpu\n          name: mxnet\n          resources: {}\n          workingDir: /incubator-mxnet/example/image-classification\n        restartPolicy: OnFailure\n  terminationPolicy:\n    chief:\n      replicaIndex: 0\n      replicaName: SCHEDULER\nstatus:\n  phase: Running\n  reason: \"\"\n  replicaStatuses:\n  - ReplicasStates:\n      Running: 1\n    mx_replica_type: SCHEDULER\n    state: Running\n  - ReplicasStates:\n      Running: 1\n    mx_replica_type: SERVER\n    state: Running\n  - ReplicasStates:\n      Running: 1\n    mx_replica_type: WORKER\n    state: Running\n  state: Running\n\n\n```\n\nThe first thing to note is the **RuntimeId**. This is a random unique\nstring which is used to give names to all the K8s resouces\n(e.g Job controllers \u0026 services) that are created by the MXJob.\n\nAs with other K8s resources status provides information about the state\nof the resource.\n\n**phase** - Indicates the phase of a job and will be one of\n - Creating\n - Running\n - CleanUp\n - Failed\n - Done\n\n**state** - Provides the overall status of the job and will be one of\n  - Running\n  - Succeeded\n  - Failed\n\nFor each replica type in the job, there will be a ReplicaStatus that\nprovides the number of replicas of that type in each state.\n\nFor each replica type, the job creates a set of K8s\n[Job Controllers](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/)\nnamed\n\n```\n${REPLICA-TYPE}-${RUNTIME_ID}-${INDEX}\n```\n\nFor example, if you have 2 servers and runtime id 76n0 MXJob\nwill create the jobs\n\n```\nserver-76no-0\nserver-76no-1\n```\n\n## Contributing\n\nPlease refer to the [developer_guide](https://github.com/kubeflow/tf-operator/blob/master/developer_guide.md)\n\n## Community\n\nThis is a part of Kubeflow, so please see [readme in kubeflow/kubeflow](https://github.com/kubeflow/kubeflow#get-involved) to get in touch with the community.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftusimple%2Fmxnet-operator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftusimple%2Fmxnet-operator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftusimple%2Fmxnet-operator/lists"}