Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/jliberma/ocp4-gpu-aws

Last synced: 20 days ago
JSON representation
Host: GitHub
URL: https://github.com/jliberma/ocp4-gpu-aws
Owner: jliberma
Created: 2019-09-02T14:58:17.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2019-10-17T18:35:35.000Z (over 5 years ago)
Last Synced: 2024-11-20T00:05:37.677Z (3 months ago)
Language: Shell
Size: 905 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        Accelerated Computing with Red Hat OpenShift Container Platform 4 and Nvidia Tesla GPU

Add a GPU Node

View the existing nodes, machines, and machine sets.

$ oc get nodes

NAME                                         STATUS   ROLES    AGE    VERSION

ip-10-0-128-195.us-east-2.compute.internal   Ready    master   2d     v1.13.4+9b19d73a0

ip-10-0-130-109.us-east-2.compute.internal   Ready    worker   2d     v1.13.4+9b19d73a0

ip-10-0-146-247.us-east-2.compute.internal   Ready    worker   2d     v1.13.4+9b19d73a0

ip-10-0-155-131.us-east-2.compute.internal   Ready    master   2d     v1.13.4+9b19d73a0

ip-10-0-172-69.us-east-2.compute.internal    Ready    master   2d1h   v1.13.4+9b19d73a0

ip-10-0-173-178.us-east-2.compute.internal   Ready    worker   2d     v1.13.4+9b19d73a0

$ oc get machines -n openshift-machine-api

NAME                                             INSTANCE              STATE     TYPE         REGION      ZONE         AGE

cluster-aus-9195-jg5lt-master-0                  i-0ceb008caaf39a24c   running   m4.xlarge    us-east-2   us-east-2a   2d1h

cluster-aus-9195-jg5lt-master-1                  i-0ec0821f2147261fa   running   m4.xlarge    us-east-2   us-east-2b   2d1h

cluster-aus-9195-jg5lt-master-2                  i-0dcb0ed63c1b5cb96   running   m4.xlarge    us-east-2   us-east-2c   2d1h

cluster-aus-9195-jg5lt-worker-us-east-2a-n87sn   i-05600905386b59947   running   m5.2xlarge   us-east-2   us-east-2a   2d1h

cluster-aus-9195-jg5lt-worker-us-east-2b-65q5f   i-0218c84d5f14c2b14   running   m5.2xlarge   us-east-2   us-east-2b   2d1h

cluster-aus-9195-jg5lt-worker-us-east-2c-9xtkk   i-00ed15f28a8b685c5   running   m5.2xlarge   us-east-2   us-east-2c   2d1h

$ oc get machinesets -n openshift-machine-api

NAME                                       DESIRED   CURRENT   READY   AVAILABLE   AGE

cluster-aus-9195-jg5lt-worker-us-east-2a   1         1         1       1           2d1h

cluster-aus-9195-jg5lt-worker-us-east-2b   1         1         1       1           2d1h

cluster-aus-9195-jg5lt-worker-us-east-2c   1         1         1       1           2d1h

Copy an existing worker machine set definition for a GPU-enabled worker machine set definition.

$ oc get machineset cluster-aus-9195-jg5lt-worker-us-east-2a -n openshift-machine-api -o json > cluster-aus-9195-jg5lt-worker-us-east-2a.json

$ oc get machineset cluster-aus-9195-jg5lt-worker-us-east-2a -n openshift-machine-api -o json > cluster-aus-9195-jg5lt-gpu-us-east-2a.json

Make the following changes to the new machine set definition:

Change the instance type of the new machine set definition to p3, which includes an Nvidia Tesla V100 GPU.

Read more about AWS P3 instance types: https://aws.amazon.com/ec2/instance-types/#Accelerated_Computing

Change the name and self link to a unique name that identifies the new machine set.

Delete the status section from the definition.

$ jq .spec.template.spec.providerSpec.value.instanceType cluster-aus-9195-jg5lt-gpu-us-east-1a.json

"p3.2xlarge"

$ diff cluster-aus-9195-jg5lt-worker-us-east-2a.json cluster-aus-9195-jg5lt-gpu-us-east-2a.json

10c10

<         "name": "cluster-aus-9195-jg5lt-worker-us-east-2a",

---

>         "name": "cluster-aus-9195-jg5lt-gpu-us-east-2a",

13c13

<         "selfLink": "/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/cluster-aus-9195-jg5lt-worker-us-east-2a",

---

>         "selfLink": "/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/cluster-aus-9195-jg5lt-gpu-us-east-2a",

23c23

<                 "machine.openshift.io/cluster-api-machineset": "cluster-aus-9195-jg5lt-worker-us-east-2a"

---

>                 "machine.openshift.io/cluster-api-machineset": "cluster-aus-9195-jg5lt-gpu-us-east-2a"

33c33

<                     "machine.openshift.io/cluster-api-machineset": "cluster-aus-9195-jg5lt-worker-us-east-2a"

---

>                     "machine.openshift.io/cluster-api-machineset": "cluster-aus-9195-jg5lt-gpu-us-east-2a"

62c62

<                         "instanceType": "m5.2xlarge",

---

>                         "instanceType": "p3.2xlarge",

131,137d130

<     },

<     "status": {

<         "availableReplicas": 1,

<         "fullyLabeledReplicas": 1,

<         "observedGeneration": 1,

<         "readyReplicas": 1,

<         "replicas": 1

Create the new machine set.

$ oc create -f cluster-aus-9195-jg5lt-gpu-us-east-2a.json

machineset.machine.openshift.io/cluster-aus-9195-jg5lt-gpu-us-east-1a created

$ oc -n openshift-machine-api get machinesets | grep gpu

cluster-aus-9195-jg5lt-gpu-us-east-2a      1         1                             45s

The machine set replica count is set to “1” so a new machine is created automatically. View the new machine.

$ oc -n openshift-machine-api get machines | grep gpu

cluster-aus-9195-jg5lt-gpu-us-east-2a-9vw7p      i-0e63046e983d721e0   running   p3.2xlarge   us-east-2   us-east-2a   23s

View the node and its labels.

$ oc -n openshift-machine-api get machines | grep gpu

cluster-aus-9195-jg5lt-gpu-us-east-2a-9vw7p      i-0e63046e983d721e0   running   p3.2xlarge   us-east-2   us-east-2a   23s

$ oc get node ip-10-0-138-78.us-east-2.compute.internal -o json | jq .metadata.labels

{

  "beta.kubernetes.io/arch": "amd64",

  "beta.kubernetes.io/instance-type": "p3.2xlarge",

  "beta.kubernetes.io/os": "linux",

  "failure-domain.beta.kubernetes.io/region": "us-east-2",

  "failure-domain.beta.kubernetes.io/zone": "us-east-2a",

  "kubernetes.io/hostname": "ip-10-0-138-78",

  "node-role.kubernetes.io/worker": "",

  "node.openshift.io/os_id": "rhcos",

  "node.openshift.io/os_version": "4.1"

}

Configure the GPU

Deploy the Node Feature Discovery Operator

The Node Feature Discovery operator identifies hardware device features in nodes. Clone the git repo.

$ git init

$ git config --global user.email "[email protected]"

$ git config --global user.name "Jacob Liberman"

$ git config --list

$ git clone https://github.com/openshift/cluster-nfd-operator

View the cluster-nfd-operator container image tags.

$ skopeo inspect docker://quay.io/zvonkok/cluster-nfd-operator | jq ".Tag , .RepoTags"

"latest"

[

  "v0.0.1",

  "v4.1",

  "p3",

  "e2e",

  "operand",

  "latest",

  "configmap",

  "nvidia-label"

]

Update the 0700_cr.yaml manifest to use the 4.1 tagged version or “latest.”

$ cat cluster-nfd-operator/manifests/0700_cr.yaml

apiVersion: nfd.openshift.io/v1alpha1

kind: NodeFeatureDiscovery

metadata:

  name: nfd-master-server

  namespace: REPLACE_NAMESPACE

spec:

  namespace: openshift-nfd

  image: quay.io/zvonkok/node-feature-discovery:v4.1

Build the NFD operator with make.

$ cd cluster-nfd-operator

$ make deploy

View the running NFD pods.

$ oc -n openshift-nfd get pods

NAME               READY   STATUS    RESTARTS   AGE

nfd-master-gq9wk   1/1     Running   0          30s

nfd-master-hkk2q   1/1     Running   0          30s

nfd-master-nvnlq   1/1     Running   0          30s

nfd-worker-4z767   1/1     Running   2          29s

nfd-worker-flz8d   1/1     Running   2          29s

nfd-worker-ghmd2   1/1     Running   2          29s

nfd-worker-qz2nl   1/1     Running   2          29s

View the Nvidia GPU feature discovered by the NFD operator.

It has the PCI ID 10de.

$ oc describe node ip-10-0-138-78.us-east-2.compute.internal | egrep 'Roles|pci'

Roles:              worker

                    feature.node.kubernetes.io/pci-1013.present=true

                    feature.node.kubernetes.io/pci-10de.present=true

                    feature.node.kubernetes.io/pci-1d0f.present=true

Deploy the Special Resource Operator

Clone the special resource operator repository.

$ cd ~

$ git clone [email protected]:zvonkok/special-resource-operator.git

Build the special resource operator with “make deploy.”

$ cd special-resource-operator/

$ make deploy

View the special resource detected by the operator.

$ oc get specialresources --all-namespaces

NAMESPACE                NAME   AGE

openshift-sro-operator   gpu    3m47s

View the pods created by the operator. They run as a daemon set that loads a driver pod and sets the SELinux security context for the device.

$ oc get pods -n openshift-sro

NAME                                   READY   STATUS      RESTARTS   AGE

nvidia-device-plugin-daemonset-6t2tf   1/1     Running     0          2m10s

nvidia-device-plugin-validation        0/1     Completed   0          104s

nvidia-driver-daemonset-jgvbf          1/1     Running     0          3m47s

nvidia-driver-validation               0/1     Completed   0          3m11s

View the logs of the validation pod executed by the special operator.

$ oc logs nvidia-device-plugin-validation -n openshift-sro

[Vector addition of 50000 elements]

Copy input data from the host memory to the CUDA device

CUDA kernel launch with 196 blocks of 256 threads

Copy output data from the CUDA device to the host memory

Test PASSED

Done

It runs a simple vector addition to verify a pod can use the GPU.

Connect to the device plugin daemonset pod and run nvidia-smi to verify communication with the GPU.

$ oc project openshift-sro

Now using project "openshift-sro" on server "https://api.cluster-aus-9195.sandbox311.opentlc.com:6443".

$ oc rsh nvidia-device-plugin-daemonset-6t2tf nvidia-smi

Sun Sep  1 04:01:54 2019       

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: N/A      |

|-------------------------------+----------------------+----------------------+

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|

|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |

| N/A   34C    P0    22W / 300W |      0MiB / 16160MiB |      0%      Default |

+-------------------------------+----------------------+----------------------+

                                                                               

+-----------------------------------------------------------------------------+

| Processes:                                                       GPU Memory |

|  GPU       PID   Type   Process name                             Usage      |

|=============================================================================|

|  No running processes found                                                 |

+-----------------------------------------------------------------------------+

Run an Nvidia Container

Run nvidia-smi in a container to test

Create a pod definition for a nvidia-smi container.

$ cat nvidia-smi.yaml

apiVersion: v1

kind: Pod

metadata:

  name: nvidia-smi

  namespace: openshift-sro

spec:

  tolerations:

    - key: nvidia.com/gpu

      operator: Exists

      effect: NoSchedule

  serviceAccount: nvidia-device-plugin

  serviceAccountName: nvidia-device-plugin

  readOnlyRootFilesystem: true

  restartPolicy: OnFailure

  containers:

  - name: nvidia-smi

    image: nvidia/cuda

    env:

      - name: NVIDIA_VISIBLE_DEVICES

        value: all

      - name: NVIDIA_DRIVER_CAPABILITIES

        value: "compute,utility"

      - name: NVIDIA_REQUIRE_CUDA  

        value: "cuda>=5.0"

    securityContext:

      allowPrivilegeEscalation: false

      capabilities:

        drop: ["ALL"]

    command: [ nvidia-smi ]

    resources:

      limits:

        nvidia.com/gpu: 1 # requesting 1 GPU

Create the pod.

$ oc create -f nvidia-smi.yaml

View the pod logs to verify the kube-scheduler placed the pod on the GPU-enabled node.

$ oc logs nvidia-smi

Mon Sep  2 04:03:31 2019       

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |

|-------------------------------+----------------------+----------------------+

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|

|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |

| N/A   36C    P0    25W / 300W |      0MiB / 16160MiB |      1%      Default |

+-------------------------------+----------------------+----------------------+

                                                                               

+-----------------------------------------------------------------------------+

| Processes:                                                       GPU Memory |

|  GPU       PID   Type   Process name                             Usage      |

|=============================================================================|

|  No running processes found                                                 |

+-----------------------------------------------------------------------------+

Run the resnet50 tensorflow benchmark

Download and run a tensorflow pod for NGC.

NGC is Nvidia’s curated registry for GPU-optimized container images.

Downloading containers from NGC requires a login.

Register for NGC here: https://ngc.nvidia.com/catalog/landing

$ sudo podman login nvcr.io

Username: $oauthtoken

Password:

a3I4dXNyYWZtcmQwODlpaHFuMnU1aHRrdWo6MzY4YjFiMGYtMDY3OS00N2VjLTg3MzUtNzA4NWM0ZDI2Njk1

Tensorflow is an open source machine learning library. View the repo tags and labels on the tensorflow images on NGC.

$ skopeo inspect docker://nvcr.io/nvidia/tensorflow:18.02-py3 | jq '.Labels , .RepoTags'

Pull the latest tensorflow image. 

$ sudo podman pull nvcr.io/nvidia/tensorflow:19.08-py3

Create a pod definition file for the resnet50 model.

Resnet50 is a convolutional neural network model for image recognition. It is commonly used to benchmark GPU performance.

$ cat << EOF > tflow-resnet50.yaml

apiVersion: v1

kind: Pod

metadata:

  name: resnet50

  namespace: openshift-sro

spec:

  tolerations:

    - key: nvidia.com/gpu

      operator: Exists

      effect: NoSchedule

  serviceAccount: nvidia-device-plugin

  serviceAccountName: nvidia-device-plugin

  readOnlyRootFilesystem: true

  restartPolicy: OnFailure

  containers:

  - name: tensorflow-resnet50

    image: nvcr.io/nvidia/tensorflow:19.08-py3

    env:

      - name: NVIDIA_VISIBLE_DEVICES

        value: all

      - name: NVIDIA_DRIVER_CAPABILITIES

        value: "compute,utility"

      - name: NVIDIA_REQUIRE_CUDA  

        value: "cuda>=5.0"

    securityContext:

      allowPrivilegeEscalation: false

      capabilities:

        drop: ["ALL"]

    command: [ "/bin/sh" ]

    args: [ "-c", "python nvidia-examples/cnn/resnet.py --layers=50 --precision=fp16"]

    resources:

      limits:

        nvidia.com/gpu: 1 # requesting 1 GPU

EOF

Create the resnet50 pod.

$ oc create -f tflow-resnet50.yaml

$ oc get pods | grep resnet

resnet50                               1/1     Running     0          67s

View the pod logs to see benchmark results.

$ oc logs resnet50 | tail -n 22

TF 1.14.0

Script arguments:

  --layers 50

  --batch_size 256

  --num_iter 90

  --iter_unit epoch

  --display_every 10

  --precision fp16

  --use_xla False

  --predict False

Training

  Step Epoch Img/sec   Loss  LR

     1   1.0    27.0  7.700  8.672 2.00000

    10  10.0   344.4  3.991  4.963 1.62000

    20  20.0   864.8  0.029  1.006 1.24469

    30  30.0   865.3  0.000  0.973 0.91877

    40  40.0   866.3  0.000  0.963 0.64222

    50  50.0   865.9  0.000  0.954 0.41506

    60  60.0   868.9  0.000  0.948 0.23728

    70  70.0   868.0  0.000  0.944 0.10889

    80  80.0   865.8  0.000  0.943 0.02988

    90  90.0   700.8  0.000  0.943 0.00025

Resources

AWS Adds Nvidia GPUs: https://aws.amazon.com/blogs/aws/new-amazon-ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/

Accelerated Computing with Nvidia GPU on OpenShift https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html

OpenShift 4 Architecture https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/architecture/index

# ocp4-gpu-aws