https://github.com/typhoonzero/kfpdist
Use Kubeflow Pipeline and Argo to run distributed training jobs.
https://github.com/typhoonzero/kfpdist
Last synced: 12 months ago
JSON representation
Use Kubeflow Pipeline and Argo to run distributed training jobs.
- Host: GitHub
- URL: https://github.com/typhoonzero/kfpdist
- Owner: typhoonzero
- Created: 2022-04-02T08:04:47.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2022-11-25T04:03:04.000Z (over 3 years ago)
- Last Synced: 2025-05-21T11:54:11.402Z (about 1 year ago)
- Language: Jupyter Notebook
- Size: 57.6 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Kubeflow Pipeline distributed training support
kfp-dist-train contains utilities to use together with
[Kubeflow Pipeline](https://www.kubeflow.org/docs/components/pipelines/)
to enable writing distributed training code directly using Kubeflow Pipeline SDK.
## Get Started
1. Setup an Kubeflow environment (maybe use https://github.com/alauda/kubeflow-chart).
2. Upload the example [kfp-dist-train.ipynb](./kfkp-dist-train.ipynb) into a Notebook
instance, or setup local pipeline submit.
3. Execute the example to submit a workflow, you can configure the number of workers
in the Kubeflow web UI. The job should look like below:

# Some Roadmap
- support `kfpdist.component(dist=True)` decorator as an wrap of `dsl.component`
- support parameter server strategy
- support pytorch