{"id":13562753,"url":"https://github.com/intelligent-machine-learning/dlrover","last_synced_at":"2025-12-29T05:26:57.580Z","repository":{"id":68841099,"uuid":"506953078","full_name":"intelligent-machine-learning/dlrover","owner":"intelligent-machine-learning","description":"DLRover: An Automatic Distributed Deep Learning System","archived":false,"fork":false,"pushed_at":"2025-04-08T12:20:40.000Z","size":144673,"stargazers_count":1408,"open_issues_count":31,"forks_count":176,"subscribers_count":44,"default_branch":"master","last_synced_at":"2025-04-10T05:39:48.477Z","etag":null,"topics":["distributed-training","hacktoberfest","k8s","llm-training"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/intelligent-machine-learning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-24T09:31:07.000Z","updated_at":"2025-04-10T04:13:08.000Z","dependencies_parsed_at":"2023-09-24T07:04:00.730Z","dependency_job_id":"528c3bf2-60be-4635-a462-d7a3224c727e","html_url":"https://github.com/intelligent-machine-learning/dlrover","commit_stats":{"total_commits":2172,"total_committers":57,"mean_commits":38.10526315789474,"dds":"0.44429097605893186","last_synced_commit":"30eba9da77ef12e3bb0687c82b5124bb1517f370"},"previous_names":["intelligent-machine-learning/easydl"],"tags_count":23,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/intelligent-machine-learning%2Fdlrover","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/intelligent-machine-learning%2Fdlrover/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/intelligent-machine-learning%2Fdlrover/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/intelligent-machine-learning%2Fdlrover/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/intelligent-machine-learning","download_url":"https://codeload.github.com/intelligent-machine-learning/dlrover/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248166882,"owners_count":21058479,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed-training","hacktoberfest","k8s","llm-training"],"created_at":"2024-08-01T13:01:11.939Z","updated_at":"2025-12-29T05:26:57.574Z","avatar_url":"https://github.com/intelligent-machine-learning.png","language":"Python","funding_links":[],"categories":["Computation and Communication Optimisation","Python","Training"],"sub_categories":["Framework"],"readme":"\n\u003cdiv id=\"top\" align=\"center\"\u003e\n\u003cimg src=\"docs/figures/dlrover_logo.png\" alt=\"Editor\" width=\"350\"\u003e\n  \n\u003ch1\u003eDLRover: An Automatic Distributed Deep Learning System\u003c/h1\u003e\n\n[![Build](https://github.com/intelligent-machine-learning/easydl/actions/workflows/main.yml/badge.svg)](https://github.com/intelligent-machine-learning/easydl/actions/workflows/main.yml)\n[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/9827/badge)](https://www.bestpractices.dev/projects/9827)\n[![Code Coverage](https://codecov.io/gh/intelligent-machine-learning/dlrover/branch/master/graph/badge.svg)](https://codecov.io/gh/intelligent-machine-learning/dlrover)\n[![PyPI Status Badge](https://badge.fury.io/py/dlrover.svg)](https://pypi.org/project/dlrover/)\n\u003c/div\u003e\n\nDLRover makes the distributed training of large AI models easy, stable, fast and green.\nIt can automatically train the Deep Learning model on the distributed cluster.\nIt helps model developers to focus on model arichtecture, without taking care of\nany engineering stuff, say, hardware acceleration, distributed running, etc.\nNow, it provides automated operation and maintenance for deep learning\ntraining jobs on K8s/Ray. Major features as\n\n- **Fault-Tolerance**: The distributed training can continue running in the event of failures.\n- **Flash Checkpoint**: The distributed training can recover failures from the in-memory checkpoint in seconds.\n- **Auto-Scaling**: The distributed training can scale up/down resources to improve the stability, throughput\nand resource utilization.\n\nFurthermore, DLRover offers extension libraries for PyTorch and TensorFlow to expedite training. These are also open-source projects available in our [GitHub repositories](https://github.com/intelligent-machine-learning).\n- [ATorch](https://github.com/intelligent-machine-learning/atorch): an extension library of PyTorch to Speed Up Training of Large LLM.\n- [TFPlus](https://github.com/intelligent-machine-learning/tfplus): an extension library of TensorFlow to Speed Up Training of Search, Recommendation and Advertisement.\n\n## Latest News\n\n- [2025/08] [Practice: Gang Scheduling with DLRover](docs/tutorial/gang_scheduling.md)\n- [2025/01] [EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models, ICLR'25.](https://arxiv.org/abs/2412.07210)\n- [2024/06] [DLRover-RM has been accepted by VLDB'24.](docs/blogs/dlrover_rm.md)\n- [2024/04] [Flash Checkpoint Supports HuggingFace transformers.Trainer to Asynchronously persist checkpoints.](docs/blogs/flash_checkpoint.md#huggingface-transformerstrainer)\n- [2024/02] [Flash Checkpoint Saves the Megatron-LM Checkpoint in Seconds.](docs/blogs/megatron_flash_checkpoint.md)\n- [2024/01] [Flash Checkpoint to Recover Large Model Training From Failure in Seconds.](docs/blogs/flash_checkpoint.md)\n- [2023/11] [ATorch supporting efficient and easy-to-use model training is released.](https://github.com/intelligent-machine-learning/atorch)\n- [2023/10] [AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference as Preconditioning Matrix, NeurIPS'24.](https://github.com/intelligent-machine-learning/atorch/blob/main/docs/README-AGD.md)\n- [2023/09] [Weighted Sharpness-Aware Minimization (WSAM) has been accepted by KDD'23.](https://github.com/intelligent-machine-learning/atorch/blob/main/docs/README-WSAM.md)\n- [2023/08] [DLRover improves the stability of pre-trained model training over thousands of GPUs.](docs/blogs/stabilize_llm_training_cn.md)\n- [2023/04] [DLRover auto-scales nodes of a DeepRec distributed training job.](docs/blogs/deeprec_autoscale_cn.md)\n\n## Why DLRover?\n\n### Fault Tolerance to Reduce the Downtime of a Large Scale Training Job\n\nDLRover can restore the training when the process fails without stopping the\ntraining job. The actions to restore training in DLRover are:\n\n1. Automatically diagnose the failure reason.\n2. Restart the process not the node due to software errors.\n3. Restart the failed nodes due to hardward errors.\n\nFor detail, we can see the [blog of fault-tolerance and elasticity](docs/blogs/stabilize_llm_training_cn.md).\n**With fault tolerance, the goodput of GLM-65B training\non thousands of GPUs increased from 69% to 95%**. The goodput is the time spent computing\nuseful new steps over the elapsed time of the training job.\nThe downtime details are shown:\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/figures/dlrover-goodput-performance.jpg\" alt=\"Editor\" width=\"600\"\u003e\n\u003c/div\u003e\n\n#### Fault Tolerance and Flash Checkpoint to Reduce Downtime of PyTorch Training\n\nIn addition to fault tolerance, DLRover provides the [flash checkpoint](docs/blogs/flash_checkpoint.md) to\nsave/load checkpoint in seconds. With flash checkpoint, the training can\nfrequently save checkpoints and reduce the roll-back step to resume training\nfrom the latest checkpoint when a failure happens. The features of flash checkpoint are:\n\n1. Asynchronously persist the checkpoint to the storage.\n2. Persist the checkpoint to the storage once the training process fails.\n3. Load the checkpoint from the host memory after the training process restarts.\n4. APIs for DDP, FSDP, DeepSpeed and Megatron-LM([cb995d5](https://github.com/NVIDIA/Megatron-LM/tree/cb995d571faea19d01a1bf55ed0fd89523b9ce64)).\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/figures/ft_llm_training/checkpoint_save_time.png\" alt=\"Editor\" width=\"396\"\u003e\n\u003cimg src=\"docs/figures/ft_llm_training/checkpoint_load_time.jpg\" alt=\"Editor\" width=\"400\"\u003e\n\n\u003ctext\u003e The Performance of DLRover Flash Checkpoint to Save/Load GPT2-1.5B.\u003c/text\u003e\n\u003c/div\u003e\n\nThe figure illustrates that the I/O time of different DL frameworks to read checkpoint files\nwhen resuming training processes. With DLRover Flash Checkpoint,\nrecovery could be completed in the order of seconds by loading checkpoints directly from shared memory,\nwhich is much faster compared to loading checkpoints from SSD and NAS.\n\n#### Fault Tolerance Improves the Stability of TensorFlow PS Training\n\nDLRover can recover failed parameter servers and workers to resume training.\n\n1. DLRover can automatically launch a Pod with more memory to recover the OOM node.\n2. DLRover can reassign the training data of a failed worker to other workers.\n3. DLRover can automatically scale up the parameter servers to fit the model size.\n\nIn AntGroup, DLRover manages hundreds of DL training jobs every day on the customized Kubernetes cluster in AntGroup.\nExcept for the failed job resulting from code errors, **the rate of completed jobs increase from 89%\nwith tf-operator in KubeFlow to 95%**. Other unrecoverable failure reasons of a job are data error,\nNaN loss of the model, network breakdown, and so on.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/figures/job-complete-rate.png\" alt=\"Editor\" width=\"600\"\u003e\n\u003c/div\u003e\n\n### Auto-Scaling to Improve Training Performance and Resource Utilization\n\nDLRover automatically scales up/down resources (for parameter servers or workers) at the runtime of a training job.\nBy monitoring the workload of nodes and throughput, DLRover can diagnose the bottleneck of the resource configuration.\nThe common bottleneck contains node straggler, the unbalanced workload of PS, insufficient CPU cores of nodes,\nand the insufficient number of nodes. DLRover can improve the training performance by dynamic resource adjustment.\n\nIn order to improve the training througphput, users prefer to\nconfigure their jobs with over-provision resources to\navoid any potential risk from insufficient resources.\nThis usually ends up in huge resource waste. DLRover Auto-Scaling\ncan allocate resources by the demand of model training to reduce\nthe waste of resources.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/figures/daily-job-resource-util.png\" alt=\"Editor\" width=\"1000\"\u003e\n\u003c/div\u003e\n\n### Dynamic Data Sharding For Elasticity and Fault-tolerance\n\nDynamic data sharding splits the dataset into many small shards and each shard only\ncontains a few batches of training samples. The worker will get a shard only when it using up\nsamples of the last one. With the dynaic sharding, DLRover can\n\n- recover the shard if the worker fails before using up samples of the shard.\n- mitigate the worker straggler by assigning more shards to the fast worker.\n\n### Integration to Offline and Online Deep Learning\n\nWith the data source transparency provided by dynamic data sharding, DLRover can be integrated with\noffline training which consumes batch data, and also supports online learning with real-time streaming data.\n(fed with a message queue like RocketMQ/Kafka/Pulsar/...,\nor executed as a training sink node inside Flink/Spark/Ray/...)\n\nBy practice, DLRover is an ideal component to build an end-to-end industrial online learning system,\n[estimator.md](docs/tutorial/estimator.md) provides a detailed example implemented with `tf.estimator.Estimator`.\n\n## How to Use DLRover to Train Your Models?\n\n### Train a PyTorch Model\n\nWe can use `dlrover-run` to run the training script which\n`torchrun` or `torch.distributed.run` can run.\n\n```bash\npip install dlrover[k8s, torch]\ndlrover-run --nnodes=1 --nproc_per_node=$NUM_TRAINERS train_scripts.py\n```\n\nThe more detail tutorials are:\n\n- [Elastic scheduling tutorial](docs/tutorial/torch_elasticjob_on_k8s.md) to\nsupport elasticity and fault tolerance of Pod on k8s.\n- [Node detection tutorial](docs/tutorial/check_node_health.md) to check the fault or slow node in a distributed job.\n- [Flash Checkpoint](docs/blogs/flash_checkpoint.md) to speed up checkpoint during training.\n\n### Train a TensorFlow Model\n\nWe can use DLRover to train a TensorFlow by the following steps:\n\n- Use TensorFlow estimator to develop the TensorFlow model.\n- Define the input of `tf.dataset` in a training configuration of DLRover.\n- Define your reader to read samples from the dataset file.\n\nWe can refer to the [estimator.md](docs/tutorial/estimator.md) to train\na model with DLRover.\n\n## What's Next?\n\n- Multi-node in-memory redundant backup checkpoint to fast failure recovery.\n- Fine-grained automatic distributed training for GPU Synchronous jobs\n  - hybrid-parallel mode\n  - adapted hyper parameters adjustment with dynamic resources\n  - more strategies for Fine-grained scenarioes\n- Full stack solution for Online Deep Learning\n- High performance extension library for Tensorflow/Pytorch to speed up training\n- ...\n\n## Contributing\n\nPlease refer to the [DEVELOPMENT](docs/developer_guide.md)\n\n## Quick Start\n\n[An Example of Flash Checkpoint.](examples/pytorch/fcp_demo.py)\n\n[Train a PyTorch Model on Kubernetes.](docs/tutorial/torch_elasticjob_on_k8s.md)\n\n[Train a GPT Model on Kubernetes.](docs/tutorial/torch_nanogpt.md)\n\n[Train a TensorFlow Estimator on Kubernetes.](docs/tutorial/tf_elasticjob_on_k8s.md)\n\n## Community\n\nWelcome to scan the DingTalk QR or search \"AI Infra\" in WeChat(微信) to join DLRover group.\nThe DingTalk QR is:\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/figures/wx-infra.jpg \" alt=\"Editor\" width=\"400\"\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/figures/dlrover_ding_group_20251125.png\" alt=\"Editor\" width=\"400\"\u003e\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fintelligent-machine-learning%2Fdlrover","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fintelligent-machine-learning%2Fdlrover","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fintelligent-machine-learning%2Fdlrover/lists"}