{"id":13742656,"url":"https://github.com/radanalyticsio/spark-operator","last_synced_at":"2025-05-07T15:06:36.460Z","repository":{"id":43902123,"uuid":"133836049","full_name":"radanalyticsio/spark-operator","owner":"radanalyticsio","description":"Operator for managing the Spark clusters on Kubernetes and OpenShift.","archived":false,"fork":false,"pushed_at":"2021-11-18T14:01:48.000Z","size":3558,"stargazers_count":157,"open_issues_count":63,"forks_count":60,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-05-07T15:06:27.060Z","etag":null,"topics":["apache-spark","kubernetes","kubernetes-operator","openshift","spark"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/radanalyticsio.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null}},"created_at":"2018-05-17T15:48:59.000Z","updated_at":"2025-05-06T03:04:36.000Z","dependencies_parsed_at":"2022-09-14T13:22:34.137Z","dependency_job_id":null,"html_url":"https://github.com/radanalyticsio/spark-operator","commit_stats":null,"previous_names":["jvm-operators/spark-operator"],"tags_count":28,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/radanalyticsio%2Fspark-operator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/radanalyticsio%2Fspark-operator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/radanalyticsio%2Fspark-operator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/radanalyticsio%2Fspark-operator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/radanalyticsio","download_url":"https://codeload.github.com/radanalyticsio/spark-operator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252902614,"owners_count":21822261,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","kubernetes","kubernetes-operator","openshift","spark"],"created_at":"2024-08-03T05:00:34.655Z","updated_at":"2025-05-07T15:06:36.437Z","avatar_url":"https://github.com/radanalyticsio.png","language":"Java","funding_links":[],"categories":["Repository is obsolete"],"sub_categories":["Awesome Operators in the Wild"],"readme":"# spark-operator\n\n[![Build status](https://travis-ci.org/radanalyticsio/spark-operator.svg?branch=master)](https://travis-ci.org/radanalyticsio/spark-operator)\n[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0)\n\n`{CRD|ConfigMap}`-based approach for managing the Spark clusters in Kubernetes and OpenShift.\n\n\u003c!--\nasciinema rec -i 3\ndocker run -\\-rm -v $PWD:/data asciinema/asciicast2gif -s 1.18 -w 104 -h 27 -t monokai 189204.cast demo.gif\n--\u003e\n[![Watch the full asciicast](https://github.com/radanalyticsio/spark-operator/raw/master/docs/ascii.gif)](https://asciinema.org/a/230927?\u0026cols=123\u0026rows=27\u0026theme=monokai)\n\n# How does it work\n![UML diagram](https://github.com/radanalyticsio/spark-operator/raw/master/docs/standardized-UML-diagram.png \"UML Diagram\")\n\n# Quick Start\n\nRun the `spark-operator` deployment: _Remember to change the `namespace` variable for the `ClusterRoleBinding` before doing this step_\n```bash\nkubectl apply -f manifest/operator.yaml\n```\n\nCreate new cluster from the prepared example:\n\n```bash\nkubectl apply -f examples/cluster.yaml\n```\n\nAfter issuing the commands above, you should be able to see a new Spark cluster running in the current namespace.\n\n```bash\nkubectl get pods\nNAME                               READY     STATUS    RESTARTS   AGE\nmy-spark-cluster-m-5kjtj           1/1       Running   0          10s\nmy-spark-cluster-w-m8knz           1/1       Running   0          10s\nmy-spark-cluster-w-vg9k2           1/1       Running   0          10s\nspark-operator-510388731-852b2     1/1       Running   0          27s\n```\n\nOnce you don't need the cluster anymore, you can delete it by deleting the custom resource by:\n```bash\nkubectl delete sparkcluster my-spark-cluster\n```\n\n# Very Quick Start\n\n```bash\n# create operator\nkubectl apply -f http://bit.ly/sparkop\n\n# create cluster\ncat \u003c\u003cEOF | kubectl apply -f -\napiVersion: v1\nkind: SparkCluster\nmetadata:\n  name: my-cluster\nspec:\n  worker:\n    instances: \"2\"\nEOF\n```\n\n# Limits and requests for cpu and memory in SparkCluster pods\n\nThe operator supports multiple fields for setting limit and request values for master and worker pods.\nYou can see these being used in the *examples/test* directory.\n\n* *cpu* and *memory* specify both limit _and_ request values for cpu and memory (that is, limits and requests will be equal)\n  This was the first mechanism provided for setting limits and requests and has been retained for backward compatibility.\n  However, a need was found to be able to set the requests and limits individually.\n\n* *cpuRequest* and *memoryRequest* set request values and take precedence over values from *cpu* and *memory* respectively\n\n* *cpuLimit* and *memoryLimit* set limit values and take precedence over values taken from *cpu* and *memory* respectively\n\n# Node Tolerations for SparkCluster pods\n\nThe operator supports specifying [Kubernetes node tolerations](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration)\nwhich will be applied to all master and worker pods in a Spark cluster.\nYou can see examples of this in use in the *examples/test* directory.\n\n* *nodeTolerations* specifies a list of Node Tolerations definitions that should\n  be applied to all master and worker nodes.\n\n## Spark Applications\n\nApart from managing clusters with Apache Spark, this operator can also manage Spark applications similarly as the `GoogleCloudPlatform/spark-on-k8s-operator`. These applications spawn their own Spark cluster for their needs and it uses the Kubernetes as the native scheduling mechanism for Spark. For more details, consult the [Spark docs](https://spark.apache.org/docs/latest/running-on-kubernetes.html).\n\n```bash\n# create spark application\ncat \u003c\u003cEOF | kubectl apply -f -\napiVersion: v1\nkind: SparkApplication\nmetadata:\n  name: my-cluster\nspec:\n  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar\n  mainClass: org.apache.spark.examples.SparkPi\nEOF\n```\n\n### OpenShift\n\nFor deployment on OpenShift use the same commands as above (with `oc` instead of `kubectl` if `kubectl` is not installed) and make sure the logged user can create CRDs: `oc login -u system:admin \u0026\u0026 oc project default`\n\n### Config Map approach\n\nThis operator can also work with Config Maps instead of CRDs. This can be useful in situations when user is not allowed to create CRDs or `ClusterRoleBinding` resources. The schema for config maps is almost identical to custom resources and you can check the [examples](./examples/test/cm).\n\n```bash\nkubectl apply -f manifest/operator-cm.yaml\n```\n\nThe manifest above is almost the same as the [operator.yaml](./manifest/operator.yaml). If the environmental variable `CRD` is set to `false`, the operator will watch on config maps with certain labels.\n\nYou can then create the Spark clusters as usual by creating the config map (CM).\n\n```bash\nkubectl apply -f examples/cluster-cm.yaml\nkubectl get cm -l radanalytics.io/kind=SparkCluster\n```\n\nor Spark applications that are natively scheduled on Spark clusters by:\n\n```bash\nkubectl apply -f examples/test/cm/app.yaml\nkubectl get cm -l radanalytics.io/kind=SparkApplication\n```\n\n### Images\n\nImage name         | Description | Layers | quay.io | docker.io\n------------------ | ----------- | ------ | ------- | ----------\n`:latest-released` | represents the latest released version | [![Layers info](https://images.microbadger.com/badges/image/radanalyticsio/spark-operator:latest-released.svg)](https://microbadger.com/images/radanalyticsio/spark-operator:latest-released) | [![quay.io repo](https://quay.io/repository/radanalyticsio/spark-operator/status \"quay.io repo\")](https://quay.io/repository/radanalyticsio/spark-operator?tab=tags) | [![docker.io repo](https://img.shields.io/docker/pulls/radanalyticsio/spark-operator.svg \"docker.io repo\")](https://hub.docker.com/r/radanalyticsio/spark-operator/tags/)\n`:latest`          | represents the master branch | [![Layers info](https://images.microbadger.com/badges/image/radanalyticsio/spark-operator:latest.svg)](https://microbadger.com/images/radanalyticsio/spark-operator:latest) |  | \n`:x.y.z`           | one particular released version | [![Layers info](https://images.microbadger.com/badges/image/radanalyticsio/spark-operator:0.1.5.svg)](https://microbadger.com/images/radanalyticsio/spark-operator:0.1.5) |  | \n\nFor each variant there is also available an image with `-alpine` suffix based on Alpine for instance [![Layers info](https://images.microbadger.com/badges/image/radanalyticsio/spark-operator:latest-released-alpine.svg)](https://microbadger.com/images/radanalyticsio/spark-operator:latest-released-alpine)\n\n### Configuring the operator\n\nThe spark-operator contains several defaults that are implicit to the creation\nof Spark clusters and applications. Here are a list of environment variables\nthat can be set to adjust the default behaviors of the operator.\n\n* `CRD` set to `true` if the operator should respond to Custom\n  Resources, and set to `false` if it should respond to ConfigMaps.\n* `DEFAULT_SPARK_CLUSTER_IMAGE` a container image reference that will be used\n  as a default for all pods in a `SparkCluster` deployment when the image is\n  not specified in the cluster manifest.\n* `DEFAULT_SPARK_APP_IMAGE` a container image reference that will be used as a\n  default for all executor pods in a `SparkApplication` deployment when the\n  image is not specified in the application manifest.\n\n_Please note that these environment variables must be set in the operator's\ncontainer, see [operator.yaml](manifest/operator.yaml) and\n[operator-cm.yaml](manifest/operator-cm.yaml) for operator deployment information._\n\n### Related projects\n\nIf you are looking for tooling to make interacting with the spark-operator\nmore convenient, please see the following.\n\n* [Ansible role](https://github.com/jvm-operators/ansible-openshift-spark-operator) is a simple way to\ndeploy the Spark operator using Ansible ecosystem. The role is [available](https://galaxy.ansible.com/jiri_kremser/spark_operator) also in the Ansible Galaxy.\n\n* [oshinko-temaki](https://pypi.org/project/oshinko-temaki/) is a shell\n  application for generating `SparkCluster` manifest definitions. It can\n  produce full schema manifests from a few simple command line flags.\n\nFor checking and verifying that your own container image will work smoothly with the operator\nuse the following tool.\n\n* [soit](https://pypi.org/project/soit/) is a CLI tool that runs a set of tests against the \ngiven image to verify if it contains the right files on the file system, \nif worker can register with master, etc. Check the code in the \n[repository](https://github.com/Jiri-Kremser/spark-operator-image-tool).\n\nThe radanalyticsio/spark-operator is not the only Kubernetes operator service\nthat targets Apache Spark.\n\n* [GoogleCloudPlatform/spark-on-k8s-operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator)\n  is an operator which shares a similar schema for the Spark cluster and application\n  resources. One major difference between it and the `radanalyticsio/spark-operator`\n  is that the latter has been designed to work well in environments where a\n  user has a limited role-based access to Kubernetes, such as on OpenShift and also that\n  `radanalyticsio/spark-operator` can deploy standalone Spark clusters.\n\n### Operator Marketplace\n\nIf you would like to install the operator into OpenShift (since 4.1) using the [Operator Marketplace](https://github.com/operator-framework/operator-marketplace), simply run:\n\n```bash\ncat \u003c\u003cEOF | kubectl apply -f -\napiVersion: operators.coreos.com/v1\nkind: OperatorSource\nmetadata:\n  name: radanalyticsio-operators\n  namespace: openshift-marketplace\nspec:\n  type: appregistry\n  endpoint: https://quay.io/cnr\n  registryNamespace: radanalyticsio\n  displayName: \"Operators from radanalytics.io\"\n  publisher: \"Jirka Kremser\"\nEOF\n```\n\nYou will find the operator in the OpenShift web console under `Catalog \u003e OperatorHub` (make sure the namespace is set to `openshift-marketplace`).\n\n### Troubleshooting\n\nShow the log:\n\n```bash\n# last 25 log entries\nkubectl logs --tail 25 -l app.kubernetes.io/name=spark-operator\n```\n\n```bash\n# follow logs\nkubectl logs -f `kubectl get pod -l app.kubernetes.io/name=spark-operator -o='jsonpath=\"{.items[0].metadata.name}\"' | sed 's/\"//g'`\n```\n\nRun the operator from your host (also possible with the debugger/profiler):\n\n```bash\njava -jar target/spark-operator-*.jar\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fradanalyticsio%2Fspark-operator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fradanalyticsio%2Fspark-operator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fradanalyticsio%2Fspark-operator/lists"}