{"id":20859780,"url":"https://github.com/stefanofioravanzo/dl-operator","last_synced_at":"2026-05-18T19:04:30.121Z","repository":{"id":77066981,"uuid":"148660241","full_name":"StefanoFioravanzo/dl-operator","owner":"StefanoFioravanzo","description":"General purpose Kubernetes operator for DL frameworks written in Python","archived":false,"fork":false,"pushed_at":"2018-09-17T15:07:10.000Z","size":12,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-12T14:26:45.741Z","etag":null,"topics":["deep-learning","distributed-training","kubernetes","kubernetes-python-client","operator"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/StefanoFioravanzo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-13T15:40:30.000Z","updated_at":"2018-09-17T15:07:12.000Z","dependencies_parsed_at":"2024-02-18T16:15:07.879Z","dependency_job_id":null,"html_url":"https://github.com/StefanoFioravanzo/dl-operator","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/StefanoFioravanzo/dl-operator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StefanoFioravanzo%2Fdl-operator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StefanoFioravanzo%2Fdl-operator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StefanoFioravanzo%2Fdl-operator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StefanoFioravanzo%2Fdl-operator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/StefanoFioravanzo","download_url":"https://codeload.github.com/StefanoFioravanzo/dl-operator/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StefanoFioravanzo%2Fdl-operator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275690368,"owners_count":25510497,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-17T02:00:09.119Z","response_time":84,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","distributed-training","kubernetes","kubernetes-python-client","operator"],"created_at":"2024-11-18T04:53:07.921Z","updated_at":"2025-09-18T00:36:53.625Z","avatar_url":"https://github.com/StefanoFioravanzo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Custom Operator for DL jobs\n\n## Overview\n\nThis project is a proof of concept of a Python Kubernetes operator for deep learning jobs. The project was inspired by Kuberflow's [tf-operator](https://github.com/kubeflow/tf-operator) and then by the work done to implement the [mx-operator](https://github.com/StefanoFioravanzo/mx-operator).\n\n**Supported frameworks**:\n\n- Tensorflow\n- MXNet\n\nThe operator was build to be general purpose, so adding any new deep learning frameworks should be straightforward. The operator works best with deep learning framework that can setup a cluster just by reading environment variables. In this way one just needs to extend the standard base class providing the framework specific env variables and container parameters.\n\n## Structure\n\n```\n.\n├── controller.py\n├── crds\n│   ├── crd\n│   │   └── mxjob.yml\n│   └── mxjob_test.yml\n├── dl_job.py\n├── logging.ini\n├── main.py\n├── replica.py\n└── settings\n    └── settings.py\n```\n\n#### `DLOperator`\n\nDefined in `controller.py`, the `DLOperator` object is tasked to continuously watch for new job requests by registering to the new custom resources stream.\n\nOnce the Operator receives a new unregistered job, it creates a new `DLJob`.\n\n#### `DLJob`\n\nObject responsible to maintain the state of the provided job spec with respect to the actual cluster state. This constant cycle of checking the current state and reconciling it with the desired state is the focal point of any custom Kubernetes controller. \n\n`DLJob` is an abstract class that must be extended with the specific deep learning framework details.\n\nA correct subclass must provide the following class variables:\n\n- `job_type`: crd name (e.g. _MXJob_)\n- `replica_types`: list with the names of the types of replicas (e.g. _scheduler_, _master_, ...)\n- `container_properties`: Dictionary with container properties (e.g. open ports, mount volume paths)\n\nAlso, the following abstract methods must be implemented:\n\n- `get_environment_variables()`: This method must return a dictionary of environment variables to be injected into the container.\n\n#### `Replica`\n\nA `Replica` object encapsulates all the logic related to one running pod. One replica manages the reconciliation process, from pod-service creation until its death.\n\n#### Run\n\nTo test and run the operator it is advisable to use a local single-node cluster via `minikube`. The operator will create the necessary custom resource definitions by itself and clean up all the allocated resources when killed.\n\n```bash\n# start minikube cluster\nminikube start\n# check cluster status\nminikube status  \n# check kubectl is using the proper context\nkubectl config get-contexts\n\n# run the operator\npython main.py\n\n# depoy dljob\nkubectl create -f crds/maxjob_test.yml\n```\n\n## TODO\n\n- [ ] Allow network communication via hostnames\n- [ ] Monitor process exit statuses and react accordingly\n- [ ] Improve job status and result reporting\n- [ ] Add automated tests","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstefanofioravanzo%2Fdl-operator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstefanofioravanzo%2Fdl-operator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstefanofioravanzo%2Fdl-operator/lists"}