Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kubeflow/pytorch-operator
PyTorch on Kubernetes
https://github.com/kubeflow/pytorch-operator
Last synced: about 2 months ago
JSON representation
PyTorch on Kubernetes
- Host: GitHub
- URL: https://github.com/kubeflow/pytorch-operator
- Owner: kubeflow
- License: apache-2.0
- Archived: true
- Created: 2018-03-09T18:51:51.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2021-12-01T17:44:48.000Z (about 3 years ago)
- Last Synced: 2024-11-12T21:49:19.126Z (2 months ago)
- Language: Jsonnet
- Homepage:
- Size: 71.2 MB
- Stars: 306
- Watchers: 31
- Forks: 143
- Open Issues: 63
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-AI-kubernetes - PyTorch - operator), [Chainer](https://github.com/kubeflow/chainer-operator), and more. We also integrate with [Ambassador](https://www.getambassador.io/) for ingress and [Pachyderm](http://pachyderm.io/) for managing your data science pipelines. (Uncategorized / Uncategorized)
README
# Kubernetes Custom Resource and Operator for PyTorch jobs
## :warning: **kubeflow/pytorch-operator is not maintained**This operator has been merged into [Kubeflow Training Operator](https://github.com/kubeflow/training-operator). This repository is not maintained and has been archived.
[![Build Status](https://travis-ci.org/kubeflow/pytorch-operator.svg?branch=master)](https://travis-ci.org/kubeflow/pytorch-operator)
[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/pytorch-operator)](https://goreportcard.com/report/github.com/kubeflow/pytorch-operator)## Overview
This repository contains the specification and implementation of `PyTorchJob` custom resource definition. Using this custom resource, users can create and manage PyTorch jobs like other built-in resources in Kubernetes. See [CRD definition](https://github.com/kubeflow/pytorch-operator/blob/master/manifests/base/crd.yaml)
## Prerequisites
- Kubernetes >= 1.8
- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl)## Installing PyTorch Operator
Please refer to the installation instructions in the [Kubeflow user guide](https://www.kubeflow.org/docs/started/getting-started/). This installs `pytorchjob` CRD and `pytorch-operator` controller to manage the lifecycle of PyTorch jobs.
## Creating a PyTorch Job
You can create PyTorch Job by defining a PyTorchJob config file. See the manifests for the [distributed MNIST example](./examples/mnist/). You may change the config file based on your requirements.
```
cat examples/mnist/v1/pytorch_job_mnist_gloo.yaml
```
Deploy the PyTorchJob resource to start training:```
kubectl create -f examples/mnist/v1/pytorch_job_mnist_gloo.yaml
```
You should now be able to see the created pods matching the specified number of replicas.```
kubectl get pods -l pytorch-job-name=pytorch-dist-mnist-gloo
```
Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.```
PODNAME=$(kubectl get pods -l pytorch-job-name=pytorch-dist-mnist-gloo,pytorch-replica-type=master -o name)
kubectl logs -f ${PODNAME}
```
## Monitoring a PyTorch Job```
kubectl get -o yaml pytorchjobs pytorch-dist-mnist-gloo
```
See status section to monitor the job status. Here is sample output when the job is successfully completed.```
apiVersion: v1
items:
- apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
creationTimestamp: 2019-01-11T00:51:48Z
generation: 1
name: pytorch-dist-mnist-gloo
namespace: default
resourceVersion: "2146573"
selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow/pytorchjobs/pytorch-dist-mnist-gloo
uid: 13ad0e7f-153b-11e9-b5c1-42010a80001e
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- args:
- --backend
- gloo
image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0
name: pytorch
resources:
limits:
nvidia.com/gpu: "1"
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- args:
- --backend
- gloo
image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0
name: pytorch
resources:
limits:
nvidia.com/gpu: "1"
status:
completionTime: 2019-01-11T01:03:15Z
conditions:
- lastTransitionTime: 2019-01-11T00:51:48Z
lastUpdateTime: 2019-01-11T00:51:48Z
message: PyTorchJob pytorch-dist-mnist-gloo is created.
reason: PyTorchJobCreated
status: "True"
type: Created
- lastTransitionTime: 2019-01-11T00:57:22Z
lastUpdateTime: 2019-01-11T00:57:22Z
message: PyTorchJob pytorch-dist-mnist-gloo is running.
reason: PyTorchJobRunning
status: "False"
type: Running
- lastTransitionTime: 2019-01-11T01:03:15Z
lastUpdateTime: 2019-01-11T01:03:15Z
message: PyTorchJob pytorch-dist-mnist-gloo is successfully completed.
reason: PyTorchJobSucceeded
status: "True"
type: Succeeded
replicaStatuses:
Master:
succeeded: 1
Worker:
succeeded: 1
startTime: 2019-01-11T00:57:22Z
```## Contributing
Please refer to the [developer_guide](developer_guide.md).