{"id":13688171,"url":"https://github.com/sql-machine-learning/elasticdl","last_synced_at":"2025-05-16T17:02:58.952Z","repository":{"id":35156924,"uuid":"154232678","full_name":"sql-machine-learning/elasticdl","owner":"sql-machine-learning","description":"Kubernetes-native Deep Learning Framework","archived":false,"fork":false,"pushed_at":"2024-01-26T07:21:05.000Z","size":32143,"stargazers_count":740,"open_issues_count":89,"forks_count":115,"subscribers_count":46,"default_branch":"develop","last_synced_at":"2025-05-14T11:09:38.617Z","etag":null,"topics":["deep-learning","distributed-systems","kubernetes","tensorflow"],"latest_commit_sha":null,"homepage":"https://elasticdl.org","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sql-machine-learning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-22T23:53:10.000Z","updated_at":"2025-04-07T02:14:32.000Z","dependencies_parsed_at":"2024-11-12T04:40:35.289Z","dependency_job_id":null,"html_url":"https://github.com/sql-machine-learning/elasticdl","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sql-machine-learning%2Felasticdl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sql-machine-learning%2Felasticdl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sql-machine-learning%2Felasticdl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sql-machine-learning%2Felasticdl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sql-machine-learning","download_url":"https://codeload.github.com/sql-machine-learning/elasticdl/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254573589,"owners_count":22093731,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","distributed-systems","kubernetes","tensorflow"],"created_at":"2024-08-02T15:01:08.254Z","updated_at":"2025-05-16T17:02:58.912Z","avatar_url":"https://github.com/sql-machine-learning.png","language":"Python","readme":"# ElasticDL: A Kubernetes-native Deep Learning Framework\n\n[![Travis-CI Build Status](https://travis-ci.com/sql-machine-learning/elasticdl.svg?branch=develop)](https://travis-ci.com/sql-machine-learning/elasticdl)\n[![Code Coverage](https://codecov.io/gh/sql-machine-learning/elasticdl/branch/develop/graph/badge.svg)](https://codecov.io/gh/sql-machine-learning/elasticdl)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n[![PyPI Status Badge](https://badge.fury.io/py/elasticdl-client.svg)](https://pypi.org/project/elasticdl-client/)\n\nElasticDL is a Kubernetes-native deep learning framework\nthat supports fault-tolerance and elastic scheduling.\n\n## Main Features\n\n### Elastic Scheduling and Fault-Tolerance\n\nThrough Kubernetes-native design, ElasticDL enables fault-tolerance and works\nwith the priority-based preemption of Kubernetes to achieve elastic scheduling\nfor deep learning tasks.\n\n### Support TensorFlow and PyTorch\n\n- TensorFlow Estimator.\n- TensorFlow Keras.\n- PyTorch\n\n### Minimalism Interface\n\nGiven a [model](model_zoo/mnist_functional_api/mnist_functional_api.py) defined\nwith Keras API, train the model distributedly with a command line.\n\n```bash\nelasticdl train \\\n  --image_name=elasticdl:mnist \\\n  --model_zoo=model_zoo \\\n  --model_def=mnist.mnist_functional_api.custom_model \\\n  --training_data=/data/mnist/train \\\n  --job_name=test-mnist \\\n  --volume=\"host_path=/data,mount_path=/data\"\n```\n\n## Quick Start\n\nPlease check out our [step-by-step tutorial](docs/tutorials/get_started.md) for\nrunning ElasticDL on local laptop, on-prem cluster, or on public cloud such as\nGoogle Kubernetes Engine.\n\n[TensorFlow Estimator on MiniKube](docs/tutorials/elasticdl_estimator.md)\n\n[TensorFlow Keras on MiniKube](docs/tutorials/elasticdl_local.md)\n\n[PyTorch on MiniKube](docs/tutorials/elasticdl_torch.md )\n\n## Background\n\nTensorFlow/PyTorch has its native distributed computing feature that is\nfault-recoverable. In the case that some processes fail, the distributed\ncomputing job would fail; however, we can restart the job and recover its status\nfrom the most recent checkpoint files.\n\nElasticDL supports fault-tolerance during distributed training.\nIn the case that some processes fail, the job would\ngo on running. Therefore, ElasticDL doesn't need to save checkpoint nor recover\nfrom checkpoints.\n\nThe feature of fault-tolerance makes ElasticDL works with the priority-based\npreemption of Kubernetes to achieve elastic scheduling.  When Kubernetes kills\nsome processes of a job to free resource for new-coming jobs with higher\npriority, the current job doesn't fail but continues with less resource.\n\nElastic scheduling could significantly improve the overall utilization of a\ncluster. Suppose that a cluster has N GPUs, and a job is using one of\nthem. Without elastic scheduling, a new job claiming N GPUs would have to wait\nfor the first job to complete before starting. This pending time could be hours,\ndays, or even weeks. During this very long time, the utilization of the cluster\nis 1/N. With elastic scheduling, the new job could start running immediately\nwith N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after the\nfirst job completes.  In this case, the overall utilization is 100%.\n\nThe feature of elastic scheduling of ElasticDL comes from its Kubernetes-native\ndesign -- it doesn't rely on Kubernetes extensions like Kubeflow to run\nTensorFlow/PyTorch programs; instead, the master process of an ElasticDL job calls\nKubernetes API to start workers and parameter servers; it also watches events\nlike process/pod killing and reacts to such events to realize fault-tolerance.\n\nIn short, ElasticDL enhances TensorFlow/PyTorch with fault-tolerance and elastic\nscheduling in the case that you have a Kubernetes cluster. We provide a tutorial\nshowing how to set up a Kubernetes cluster on Google Cloud and run ElasticDL\njobs there.  We respect TensorFlow's native distributed computing feature, which\ndoesn't require specific computing platforms like Kubernetes and allows\nTensorFlow running on any platform.\n\n## Development Guide\n\nPlease refer to [this document](elasticdl/README.md) for development guide.\n","funding_links":[],"categories":["分布式机器学习","Python","AI \u0026 Machine Learning Platforms"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsql-machine-learning%2Felasticdl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsql-machine-learning%2Felasticdl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsql-machine-learning%2Felasticdl/lists"}