{"id":28740984,"url":"https://github.com/aliyuncontainerservice/et-operator","last_synced_at":"2025-10-17T08:28:11.512Z","repository":{"id":50928965,"uuid":"315880301","full_name":"AliyunContainerService/et-operator","owner":"AliyunContainerService","description":"Kubernetes Operator for AI and Bigdata Elastic Training","archived":false,"fork":false,"pushed_at":"2025-01-10T09:17:16.000Z","size":564,"stargazers_count":86,"open_issues_count":12,"forks_count":24,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-06-16T07:09:47.888Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AliyunContainerService.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-11-25T08:50:24.000Z","updated_at":"2025-06-06T08:13:40.000Z","dependencies_parsed_at":"2025-06-16T07:09:50.414Z","dependency_job_id":"fd819866-99da-4791-99f2-5716ab1d3ffa","html_url":"https://github.com/AliyunContainerService/et-operator","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AliyunContainerService/et-operator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AliyunContainerService%2Fet-operator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AliyunContainerService%2Fet-operator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AliyunContainerService%2Fet-operator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AliyunContainerService%2Fet-operator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AliyunContainerService","download_url":"https://codeload.github.com/AliyunContainerService/et-operator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AliyunContainerService%2Fet-operator/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267335484,"owners_count":24070771,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-27T02:00:11.917Z","response_time":82,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-16T07:09:46.859Z","updated_at":"2025-10-17T08:28:06.473Z","avatar_url":"https://github.com/AliyunContainerService.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Elastic Training Operator\n\n## Overview\n\nSome distributed deep learning training framework like [horovod](https://github.com/horovod/horovod)  support elastic training, which enables training job scale up and down the number of workers dynamically at runtime without interrupting the training process.\n\nEt-operator provides a set of Kubernetes Custom Resource Definition that makes it easy to run horovod or AIACC elastic training in kubernetes. After submit a training job, you can scaleIn and scaleOut workers during training on demand, which can make your training job more elasticity and efficient.\n\n\n## Design\nThe `et-operator`, work with 3 new CRDs, `TrainingJob`, `ScaleIn` and `ScaleOut`.\n\n### TrainingJob \nUser submit a `TrainingJob` CR to specify a training job detail, like launcher's and worker's image, entrypoint command, replicas of workers.\nThe `et-operator` will receive the creation event, then create the sub resource (like pods, configmap, service, secret) of the `TrainingJob`, and \n\n\n![TrainingJob](./docs/images/trainingjob.png)\n\nThe `TrainingJob` will create workers pods and services, generate the `Secret` and `ConfigMap` for launcher pod,\nwhen all workers ready, then operator will create the launcher pod and sync pods status. \n\nAfter launcher pod exit, `et-operator` will uppdate `TrainingJob` phase to `Success` or `Fail` according to pod's exit code,\nthen do the cleanup.\n\n![TrainingJob Resource](./docs/images/trainingjob-resource.png)\n\n#### ScaleIN\nWe can submit ScaleIn and ScaleOut resource to specify the scaleOut and scaleIn action of `TrainingJob`.\n\nAfter the `TrainingJob` start running, `et-operator` will continuously check whether there are available `ScaleIn` and `ScaleOut` CR, and execute it.  \n\nIn `ScaleIn` CR, we can specify the trainingJob's name and which workers that need to scaleIn (by `count` or detail worker's name). \nWhen `et-operator` find an available `ScaleIn` CR, it will start to execute the scalein operation.\nFirstly, it will update the host config of `TrainingJob`, \nIn horovod elastic mode, it needs a script that return the host's topology , the change of hosts will notify the launcher, then and it will shutdown the worker process not in hosts gracefully.  \n  \nAfter the hostFile updated, `et-operator` start to detect whether the launch process exist, \nwhen `et-operator` confirm that the scalein worker's launch process not exit, it will delete the worker's resource.  \n\n![ScaleIn](./docs/images/scalein.png)\n\n\n#### ScaleOut\nIn `ScaleOut` CR, we can specify the trainingJob's name and the count that we want to scaleout. \nWhen `et-operator` start to execute the scalein operation,\ndifferent from scaleIn, it will firstly create the new worker's resources.\nAfter worker's resources ready, then update the hostFile. \n  \n\n![ScaleOut](./docs/images/scaleout.png)\n\n\n## Setup\n### Installation\n\n```\ngit clone http://github.com/aliyunContainerService/et-operator\ncd et-operator\nkubectl create -f config/deploy.yaml\n```\n\nOr you can customize some config, and run:\n\n```\nmake deploy\n```\n\nYou can check whether the Training Job custom resource is installed via:\n\n```\nkubectl get crd\n\n```\n\n```\nNAME                                    CREATED AT\nscaleins.kai.alibabacloud.com           2020-11-11T11:16:13Z\nscaleouts.kai.alibabacloud.com          2020-11-11T11:16:13Z\ntrainingjobs.kai.alibabacloud.com       2020-11-11T11:16:13Z\n```\n\nCheck the operator status\n\n```\nkubectl -n kube-ai get pod\n```\n\n```\nNAME                          READY   STATUS    RESTARTS   AGE\net-operator-ddd56ff8c-tdr2n   1/1     Running   0          59s\n\n```\n\n\n## User guide\n\n### Create a elastic training job\nThe training code need to be constructed in in elastic training mod,  [see detail](https://horovod.readthedocs.io/en/stable/elastic_include.html).\nYou can create an Training job by submit an TrainingJob YAML file. You can goto [Horovod TrainingJob Example](./example/training_job.yaml) to see the example, and you can modify it in need.\n\n\n```\nkubectl apply -f examples/training_job.yaml\n\n```\n\n#### Check TrainingJob status\n\n```\n# kubectl get trainingjob\nNAME                          PHASE     AGE\nelastic-training              Running   77s\n```\n\n```\n# kubectl get po\nNAME                                      READY   STATUS             RESTARTS   AGE\nelastic-training-launcher                 1/1     Running            0          7s\nelastic-training-worker-0                 1/1     Running            0          10s\nelastic-training-worker-1                 1/1     Running            0          9s\n```\n\n\n### ScaleIn training job\nWhen you need to scaleIn the trainingJob workers, you can submit an ScaleIn CustomResource.\nIn `Scalein` Spec, you need to spec the name of TrainingJob, et-operator will find the match trainingJob and execute scaleIn to it. You can specify the workers to scaleIn [ScaleIn by count](./example/scale_in_pod.yaml) or just specify the count [ScaleIn by count](./example/scale_in_count.yaml) .\n\n```\nkubectl create -f examples/scale_in_count.yaml\n\n\n```\n#### Check Scalein status\n\n```\n# kubectl get scalein\nNAME                                     PHASE            AGE\nscalein-sample-t8jxd                     ScaleSucceeded   11s\n```\n\n\n```\n# kubectl get po\nNAME                                      READY   STATUS             RESTARTS   AGE\nelastic-training-launcher                 1/1     Running            0          47s\nelastic-training-worker-0                 1/1     Running            0          50s\n```\n\n### ScaleOut training job\nWhen you need to scaleOut the trainingJob workers, you can submit an ScaleOut CustomResource, which just specify the count of workers you want to scaleOut.\n\n```\nkubectl create -f examples/scale_out.yaml\n\n```\n#### Check ScaleOut status\n\n```\n# kubectl get scaleout\nNAME                                     PHASE            AGE\nelastic-training-scaleout-9dtmw          ScaleSucceeded   30s\n\n```\n\n```\n# kubectl get po\nNAME                                      READY   STATUS             RESTARTS   AGE\nelastic-training-launcher                 1/1     Running            0          2m5s\nelastic-training-worker-0                 1/1     Running            0          2m8s\nelastic-training-worker-1                 1/1     Running            0          40s\nelastic-training-worker-2                 1/1     Running            0          40s\n```\n\n\n## Roadmap\n\n* Use `kubectl exec` replace ssh: the block major problem is that `kubectl exec` will hang when target pod shutdown but what we want is to exit process. \n* Support spot instance in public cloud platform, before node released, we should trigger a scaleIn to the training worker who's workers on the spot nodes.\n* Support fault tolerance\n\n## Developing\nPrerequisites:\n\n* Go \u003e= 1.8\n* kubebuilder \u003e= 0.4.1\n\n```\nmkdir -p $(go env GOPATH)/src/github.com/aliyunContainerService\ncd $(go env GOPATH)/src/github.com/aliyunContainerService\ngit clone https://github.com/aliyunContainerService/et-operator\ncd et-operator\nmake\n```\n\nBuild operator\n\n```\nexport IMG=\u003cimage repo\u003e\nmake docker-build\nmake docker-push\n\n```\n\n\nRunning operator in local\n\n```\nmake run-local\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faliyuncontainerservice%2Fet-operator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faliyuncontainerservice%2Fet-operator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faliyuncontainerservice%2Fet-operator/lists"}