https://github.com/cscfi/spark-openshift
Run Apache Spark on Openshift
https://github.com/cscfi/spark-openshift
Last synced: 8 months ago
JSON representation
Run Apache Spark on Openshift
- Host: GitHub
- URL: https://github.com/cscfi/spark-openshift
- Owner: CSCfi
- License: mit
- Created: 2018-08-16T12:10:21.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2023-06-13T08:03:01.000Z (about 3 years ago)
- Last Synced: 2025-01-07T12:25:30.714Z (over 1 year ago)
- Language: Dockerfile
- Size: 10.2 MB
- Stars: 9
- Watchers: 4
- Forks: 9
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# spark-openshift
Run Apache Spark on Openshift. Based on https://github.com/Uninett/helm-charts
## Quickstart:
The template provisions a Spark cluster based on the configuration on Openshift.
**NOTE: Make sure, you create a new openshift project in order to run this. It is also recommended that you have only one cluster per openshift project**
*If you are looking for instructions on how to install custom spark libraries on top of your spark cluster, [click here](https://github.com/CSCfi/spark-openshift/blob/master/installing_libraries.md)*
There are 3 deployments created -
1. **Spark Master**: Serves as the master of the cluster and runs Spark UI
2. **Spark Worker**: Runs worker instances in the cluster
3. **Jupyter Notebook**: Serves as the Spark Driver, where one writes the code and submits it to the master
### Variables:
Listed below are some of the variables that should be changed.
**Please NOTE : The values for the CPU and Memory should only be changed (to avoid errors) after checking the project quota allocated to your Openshift project.** You should increase or request the admins to increase it for you, if needed.
#### Mandatory Required Values:
- **Cluster Name**: Unique identifier for your cluster
- **Username**: Username for authenticating and logging into your Spark cluster and Jupyter (Recommended: create a new username, don't use any existing one)
- **Password**: Password for authenticating and logging into your Spark cluster and Jupyter (Recommended: create a new password, don't use any existing one)
- **Worker Replicas**: Number of workers to have (Default: 4)
- **Storage Size**: Persistent storage volume size (Default: 10Gi)
#### Optional Required Values:
- **Enable Jupyter Lab**: Specify whether if you want to use Jupyter Lab instead of the default Jupyter Notebook (Default: false)
- **Master CPU**: Number of cores for the master node of the cluster
- **Master Memory**: Memory for the master node of the cluster
- **Worker CPU**: Number of cores for each worker of the cluster (Default: 2)
- **Worker Memory**: Memory of each worker of the cluster (Default: 4G)
- **Executor Default Cores**: Default value for Spark Executor Cores (See official Spark documention for more) (Default: 2)
- **Executor Default Memory**: Default value for Spark Executor Memory (**Should always be less than the Worker memory!**) (Default: 3G)
- **Driver CPU**: Number of cores for the driver (Jupyter Notebook)
- **Driver Memory**: Memory of the driver (Jupyter Notebook)
#### Do not change the following variables, unless you know what you're doing
- **Master Image**: Docker Image for the Master
- **Worker Image**: Docker Image for the Worker
- **Driver Image**: Docker Image for the Driver
- **Application Hostname Suffix**: The exposed hostname suffix that will be used to create routes for Spark UI and Jupyter Notebook
*NOTE: The template assumes that the request and the limits are same for all the containers. If you wish to have different limits, it's recommended to edit the template*
## If running through the Command line:
* Download the oc client for openshift
* `oc login`
* `oc new-project `
* `oc process -f spark-template.yml -p CLUSTER_NAME="cluster_name" -p USERNAME="username" -p PASSWORD="password" | oc apply -f -`
### Adding more workers
By default, the template will deploy 4 workers. If you know that you will need more than 4 at the beginning, you can use this command:
```sh
oc process -f spark-template.yml -p CLUSTER_NAME="cluster_name" -p USERNAME="username" -p PASSWORD="password" -p WORKER_REPLICAS="x"
```
If after the deployment you need more or less workers, you can type this command to increase/decrease the number of worker pods:
```sh
oc scale dc/ --replicas=x
```
You can list your DeploymentConfig with this command:
```sh
oc get dc
```
### Deleting
* `oc delete all -l app=spark`
* `oc delete configmap -l app=spark`
* `oc delete secret -l app=spark`
* You might also want to delete the persistent volume created by the setup by typing `oc delete pvc -l app=spark`
### Adding more storage from OpenShift UI
From OpenShift console
* open Storage -> Create Storage -> Fill required fields and press Create button.
* Application -> Deployments -> For each of the items go Configuration tab -> Add Storage -> Fill desired Mount Path ie. **/mnt/data** -> Type **Volume Name** or leave empty for automatically generated -> press Add.
Automatic redeployment starts and after repeating above steps to all items, new pvc will be mounted to application.