Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dominodatalab/rayclusterscaler
Basic Ray Cluster Scaler (Beta)
https://github.com/dominodatalab/rayclusterscaler
Last synced: about 2 months ago
JSON representation
Basic Ray Cluster Scaler (Beta)
- Host: GitHub
- URL: https://github.com/dominodatalab/rayclusterscaler
- Owner: dominodatalab
- Created: 2023-11-20T23:47:51.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-11-21T14:36:55.000Z (about 1 year ago)
- Last Synced: 2023-11-22T05:40:01.404Z (about 1 year ago)
- Language: Python
- Size: 14.6 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Ray Cluster Custom Scaler (Beta)
This is a basic Ray Cluster Scaler## Installation
If `domino-field` namespace is not present create using below command
```shell
kubectl create namespace domino-field
kubectl label namespace domino-field domino-compute=true
kubectl label namespace domino-field domino-platform=true
``````shell
export field_namespace=domino-field
helm install -f ./values.yaml rayclusterscaler helm/rayclusterscaler -n ${field_namespace}
```
## Upgrade```shell
export field_namespace=domino-fieldhelm upgrade -f ./values.yaml rayclusterscaler helm/rayclusterscaler -n ${field_namespace}
```## Delete
```shell
export field_namespace=domino-field
helm delete rayclusterscaler -n ${field_namespace}
```## Endpoints
This service `http://rayclusterscaler-svc.domino-field/` provides the following endpoints
1. GET `/raycluster/list` - To list all ray clusters which are either owned by the caller or all clusters if an Admin invokes the endpoint
2. GET `/raycluster/` - Get the ray cluster owned by the caller (or any for an admin). Returns `403-Unauthorized` if
the caller tries to retrieve a Ray cluster now permitted to fetch
3. POST `/raycluster/scale` - This scale scales the Ray Cluster. It takes the payload
```json
{
"cluster_name":"ray-....",
"replicas" : 5
}
```Each of these endpoints can be authenticated from inside the workspace by passing a header as follows:
```shell
import requests
import os
access_token_endpoint='http://localhost:8899/access-token'
resp = requests.get(access_token_endpoint)token = resp.text
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer " + token,
}
#Example
endpoint='http://rayclusterscaler-svc.domino-field/rayclusterscaler/list'
resp = requests.get(endpoint,headers=headers)
```Alternatively you can also pass the `DOMINO_API_KEY`
```shell
import requests
import os
access_token_endpoint='http://localhost:8899/access-token'
resp = requests.get(access_token_endpoint)
domino_api_key = os.environ['DOMINO_USER_API_KEY']token = resp.text
headers = {
"Content-Type": "application/json",
"X-Domino-Api-Key": domino_api_key,
}
#Example
endpoint='http://rayclusterscaler-svc.domino-field/rayclusterscaler/list'
resp = requests.get(endpoint,headers=headers)
```## Note on the scale endpoint
When you call the scale endpoint follow up with the GET `/raycluster/` endpoint to verify that the cluster has
scaled. This information is obtained by the checking the `status` section of the returned `json` which should appear
something like this```json
"status": {
"clusterStatus": "Running",
"nodes": [
"ray-655cb2de368ad4624b1e7d7b-ray-head-0",
"ray-655cb2de368ad4624b1e7d7b-ray-worker-0",
"ray-655cb2de368ad4624b1e7d7b-ray-worker-1",
"ray-655cb2de368ad4624b1e7d7b-ray-worker-2",
"ray-655cb2de368ad4624b1e7d7b-ray-worker-3",
"ray-655cb2de368ad4624b1e7d7b-ray-worker-4"
],
"startTime": "2023-11-21T13:38:58Z",
"workerReplicas": 5,
"workerSelector": "app.kubernetes.io/component=worker,app.kubernetes.io/instance=ray-655cb2de368ad4624b1e7d7b,app.kubernetes.io/name=ray"
}
```Additionally verify that the `spec.autoscaling` section reflects the scale correctly. For a scaling number of `5` the
values for the `minReplicas` and `maxReplicas` should be as below```json
"autoscaling": {
"maxReplicas": 6,
"minReplicas": 5
}
```Lastly, the `spec.worker.replicas` attribute in the above example should be equal to the value for `minReplicas` (`5` in our example)
> Also make sure you review the Ray Cluster UI to verify that every worker has joined the cluster before launching the
> job. The `RayCluster` CRD will indicate running pods. Only the Ray Cluster UI (or API) will confirm that the workers
> have joined the cluster***WARNING/ALERT***
Remember to scale down the ray cluster when you have finished. Note that the value for
`minReplicas` does not permit the cluster to scale down on its own.Remember to call the POST endpoint `/raycluster/scale` with the following payload
```json
{
"cluster_name":"ray-....",
"replicas" : 1
}
```