{"id":19402780,"url":"https://github.com/cscfi/spark-openshift","last_synced_at":"2025-10-13T23:03:53.095Z","repository":{"id":56159928,"uuid":"144982768","full_name":"CSCfi/spark-openshift","owner":"CSCfi","description":"Run Apache Spark on Openshift","archived":false,"fork":false,"pushed_at":"2023-06-13T08:03:01.000Z","size":10727,"stargazers_count":9,"open_issues_count":5,"forks_count":9,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-01-07T12:25:30.714Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Dockerfile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CSCfi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-08-16T12:10:21.000Z","updated_at":"2024-06-28T15:22:50.000Z","dependencies_parsed_at":"2024-11-10T11:26:26.670Z","dependency_job_id":"88a78c6f-05d6-4382-9e8c-5739da00d401","html_url":"https://github.com/CSCfi/spark-openshift","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CSCfi%2Fspark-openshift","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CSCfi%2Fspark-openshift/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CSCfi%2Fspark-openshift/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CSCfi%2Fspark-openshift/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CSCfi","download_url":"https://codeload.github.com/CSCfi/spark-openshift/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240577839,"owners_count":19823527,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T11:25:48.610Z","updated_at":"2025-10-13T23:03:48.061Z","avatar_url":"https://github.com/CSCfi.png","language":"Dockerfile","funding_links":[],"categories":[],"sub_categories":[],"readme":"# spark-openshift\nRun Apache Spark on Openshift. Based on https://github.com/Uninett/helm-charts\n\n## Quickstart:\n\nThe template provisions a Spark cluster based on the configuration on Openshift. \n**NOTE: Make sure, you create a new openshift project in order to run this. It is also recommended that you have only one cluster per openshift project**\n\n*If you are looking for instructions on how to install custom spark libraries on top of your spark cluster, [click here](https://github.com/CSCfi/spark-openshift/blob/master/installing_libraries.md)*\n\nThere are 3 deployments created - \n1. **Spark Master**: Serves as the master of the cluster and runs Spark UI\n2. **Spark Worker**: Runs worker instances in the cluster\n3. **Jupyter Notebook**: Serves as the Spark Driver, where one writes the code and submits it to the master\n\n### Variables:\n\nListed below are some of the variables that should be changed.\n\n**Please NOTE : The values for the CPU and Memory should only be changed (to avoid errors) after checking the project quota allocated to your Openshift project.** You should increase or request the admins to increase it for you, if needed.\n\n#### Mandatory Required Values:\n- **Cluster Name**: Unique identifier for your cluster\n- **Username**: Username for authenticating and logging into your Spark cluster and Jupyter (Recommended: create a new username, don't use any existing one)\n- **Password**: Password for authenticating and logging into your Spark cluster and Jupyter (Recommended: create a new password, don't use any existing one)\n- **Worker Replicas**: Number of workers to have (Default: 4)\n\n- **Storage Size**: Persistent storage volume size (Default: 10Gi)\n\n#### Optional Required Values:\n- **Enable Jupyter Lab**: Specify whether if you want to use Jupyter Lab instead of the default Jupyter Notebook (Default: false) \n- **Master CPU**: Number of cores for the master node of the cluster\n- **Master Memory**: Memory for the master node of the cluster\n- **Worker CPU**: Number of cores for each worker of the cluster (Default: 2)\n- **Worker Memory**: Memory of each worker of the cluster (Default: 4G)\n\n- **Executor Default Cores**: Default value for Spark Executor Cores (See official Spark documention for more) (Default: 2)\n- **Executor Default Memory**: Default value for Spark Executor Memory (**Should always be less than the Worker memory!**) (Default: 3G)\n\n- **Driver CPU**: Number of cores for the driver (Jupyter Notebook)\n- **Driver Memory**: Memory of the driver (Jupyter Notebook)\n\n#### Do not change the following variables, unless you know what you're doing\n- **Master Image**: Docker Image for the Master\n- **Worker Image**: Docker Image for the Worker \n- **Driver Image**: Docker Image for the Driver \n- **Application Hostname Suffix**: The exposed hostname suffix that will be used to create routes for Spark UI and Jupyter Notebook\n\n*NOTE: The template assumes that the request and the limits are same for all the containers. If you wish to have different limits, it's recommended to edit the template*\n\n\n## If running through the Command line:\n\n* Download the oc client for openshift\n* `oc login`\n* `oc new-project \u003cproject-name\u003e`\n* `oc process -f spark-template.yml -p CLUSTER_NAME=\"cluster_name\" -p USERNAME=\"username\" -p PASSWORD=\"password\" | oc apply -f -`\n\n### Adding more workers\nBy default, the template will deploy 4 workers. If you know that you will need more than 4 at the beginning, you can use this command:  \n```sh\noc process -f spark-template.yml -p CLUSTER_NAME=\"cluster_name\" -p USERNAME=\"username\" -p PASSWORD=\"password\" -p WORKER_REPLICAS=\"x\"\n```\n\nIf after the deployment you need more or less workers, you can type this command to increase/decrease the number of worker pods:  \n```sh\noc scale dc/\u003cyour_deployment_name\u003e --replicas=x\n```\n\nYou can list your DeploymentConfig with this command:\n```sh\noc get dc\n```\n\n### Deleting\n\n* `oc delete all -l app=spark`\n* `oc delete configmap -l app=spark`\n* `oc delete secret -l app=spark`\n* You might also want to delete the persistent volume created by the setup by typing `oc delete pvc -l app=spark`\n\n### Adding more storage from OpenShift UI\nFrom OpenShift console\n* open Storage -\u003e Create Storage -\u003e Fill required fields and press Create button.\n* Application -\u003e Deployments -\u003e For each of the items go Configuration tab -\u003e Add Storage -\u003e Fill desired Mount Path ie. **/mnt/data**  -\u003e Type **Volume Name** or leave empty for automatically generated -\u003e press Add.\n\nAutomatic redeployment starts and after repeating above steps to all items, new pvc will be mounted to application.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcscfi%2Fspark-openshift","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcscfi%2Fspark-openshift","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcscfi%2Fspark-openshift/lists"}