{"id":16199408,"url":"https://github.com/bbenzikry/spark-eks","last_synced_at":"2025-03-19T05:30:49.023Z","repository":{"id":95865116,"uuid":"290907814","full_name":"bbenzikry/spark-eks","owner":"bbenzikry","description":"Examples and custom spark images for working with the spark-on-k8s operator on AWS","archived":false,"fork":false,"pushed_at":"2021-02-14T01:56:38.000Z","size":391,"stargazers_count":27,"open_issues_count":2,"forks_count":5,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-17T04:05:46.997Z","etag":null,"topics":["aws","docker","dockerfile","eks","eks-cluster","glue-catalog","kubernetes","kubernetes-operator","metastore","spark"],"latest_commit_sha":null,"homepage":"","language":"Dockerfile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bbenzikry.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-08-27T23:59:11.000Z","updated_at":"2023-06-29T10:18:03.000Z","dependencies_parsed_at":"2023-08-31T20:45:21.231Z","dependency_job_id":null,"html_url":"https://github.com/bbenzikry/spark-eks","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bbenzikry%2Fspark-eks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bbenzikry%2Fspark-eks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bbenzikry%2Fspark-eks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bbenzikry%2Fspark-eks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bbenzikry","download_url":"https://codeload.github.com/bbenzikry/spark-eks/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244364668,"owners_count":20441458,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","docker","dockerfile","eks","eks-cluster","glue-catalog","kubernetes","kubernetes-operator","metastore","spark"],"created_at":"2024-10-10T09:25:18.490Z","updated_at":"2025-03-19T05:30:49.014Z","avatar_url":"https://github.com/bbenzikry.png","language":"Dockerfile","funding_links":[],"categories":[],"sub_categories":[],"readme":"# spark-on-eks\n\n\u003c!-- markdownlint-disable MD033 --\u003e\n\u003ccenter\u003e\n\u003ca href=\"#\"\u003e\n\u003cimg src=\"https://user-images.githubusercontent.com/1993348/91601148-d0b01b80-e971-11ea-9903-6299b2396499.png\" width=\"100%\" height=\"50%\"\u003e\n\u003c/a\u003e\n\nExamples and custom spark images for working with the spark-on-k8s operator on AWS.\n\nAllows using Spark 2 with IRSA and Spark 3 with IRSA and AWS Glue as a metastore.\n\nNote: Spark 3 images also include relevant jars for working with the [S3A commiters](https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html)\n\nIf you're looking for the Spark 3 custom distributions, you can find them [here](https://github.com/bbenzikry/spark-glue/releases)\n\n**Note**: Spark 2 images will not be updated, please see the [FAQ](#faq)\n\n---\n\n[![operator](https://img.shields.io/docker/cloud/build/bbenzikry/spark-eks-operator?style=plastic\u0026label=operator)](https://hub.docker.com/r/bbenzikry/spark-eks-operator)\n[![spark-eks](https://img.shields.io/docker/cloud/build/bbenzikry/spark-eks?style=plastic\u0026label=spark-eks)](https://hub.docker.com/r/bbenzikry/spark-eks)\n\n\n\u003c/center\u003e\n\n## Prerequisites\n\n- Deploy [spark-on-k8s operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) using the [helm chart](https://github.com/helm/charts/tree/master/incubator/sparkoperator) and the [patched operator](https://github.com/bbenzikry/spark-on-k8s-operator/tree/hive-subpath) image `bbenzikry/spark-eks-operator:latest`\n\nSuggested values for the helm chart can be found in the [flux](./flux/operator.yaml) example.\n\n\u003e Note: Do not create the spark service account automatically as part of chart use.\n\n## using IAM roles for service accounts on EKS\n\n### Creating roles and service account\n\n- Create an AWS role for driver\n- Create an AWS role for executors\n\n\u003e [AWS docs on creating policies and roles](https://docs.aws.amazon.com/eks/latest/userguide/create-service-account-iam-policy-and-role.html)\n\n- Add default service account EKS role for executors in your spark job namespace ( optional )\n\n```yaml\n# NOTE: Only required when not building spark from source or using a version of spark \u003c 3.1. In 3.1, executor roles will rely on the driver definition. At the moment they execute with the default service account.\napiVersion: v1\nkind: ServiceAccount\nmetadata:\n  name: default\n  namespace: SPARK_JOB_NAMESPACE\n  annotations:\n    # can also be the driver role\n    eks.amazonaws.com/role-arn: \"arn:aws:iam::ACCOUNT_ID:role/executor-role\"\n```\n\n- Make sure spark service account ( used by driver pods ) is configured to an EKS role as well\n\n```yaml\napiVersion: v1\nkind: ServiceAccount\nmetadata:\n  name: spark\n  namespace: SPARK_JOB_NAMESPACE\n  annotations:\n    eks.amazonaws.com/role-arn: \"arn:aws:iam::ACCOUNT_ID:role/driver-role\"\n```\n\n### Building a compatible image\n\n- For spark \u003c 3.0.0, see [spark2.Dockerfile](./docker/spark2.Dockerfile)\n\n- For spark 3.0.0+, see [spark3.Dockerfile](./docker/spark3.Dockerfile)\n\n- For pyspark, see [pyspark.Dockerfile](./docker/pyspark.Dockerfile)\n\n### Submit your spark application with IRSA support\n\n#### Select the right implementation for you\n\n\u003e Below are examples for latest versions.\n\u003e\n\u003e If you want to use pinned versions, all images are tagged by the commit SHA.\n\u003e\n\u003e You can find a full list of tags [here](https://hub.docker.com/repository/docker/bbenzikry/spark-eks/tags)\n\n```dockerfile\n# spark2\nFROM bbenzikry/spark-eks:spark2-latest\n# spark3\nFROM bbenzikry/spark-eks:spark3-latest\n# pyspark2\nFROM bbenzikry/spark-eks:pyspark2-latest\n# pyspark3\nFROM bbenzikry/spark-eks:pyspark3-latest\n```\n\n#### Submit your SparkApplication spec\n\n```yaml\nhadoopConf:\n  # IRSA configuration\n  \"fs.s3a.aws.credentials.provider\": \"com.amazonaws.auth.WebIdentityTokenCredentialsProvider\"\ndriver:\n  .....\n  labels:\n    .....\n  serviceAccount: SERVICE_ACCOUNT_NAME\n\n  # See: https://github.com/kubernetes/kubernetes/issues/82573\n  # Note: securityContext has changed in recent versions of the operator to podSecurityContext\n  podSecurityContext:\n    fsGroup: 65534\n```\n\n### Working with AWS Glue as metastore\n\n#### Glue Prerequisites\n\n- Make sure your driver and executor roles have the relevant glue permissions\n\n```json5\n{\n  /* \n  Example below depicts the IAM policy for accessing db1/table1.\n  Modify this as you deem worthy for spark application access.\n  */\n\n  Effect: \"Allow\",\n  Action: [\"glue:*Database*\", \"glue:*Table*\", \"glue:*Partition*\"],\n  Resource: [\n    \"arn:aws:glue:us-west-2:123456789012:catalog\",\n    \"arn:aws:glue:us-west-2:123456789012:database/db1\",\n    \"arn:aws:glue:us-west-2:123456789012:table/db1/table1\",\n\n    \"arn:aws:glue:eu-west-1:123456789012:database/default\",\n    \"arn:aws:glue:eu-west-1:123456789012:database/global_temp\",\n    \"arn:aws:glue:eu-west-1:123456789012:database/parquet\",\n  ],\n}\n```\n\n- Make sure you are using the patched operator image\n- Add a config map to your spark job namespace as defined [here](conf/configmap.yaml)\n\n```yaml\napiVersion: v1\ndata:\n  hive-site.xml: |-\n    \u003cconfiguration\u003e\n        \u003cproperty\u003e\n            \u003cname\u003ehive.imetastoreclient.factory.class\u003c/name\u003e\n            \u003cvalue\u003ecom.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory\u003c/value\u003e\n        \u003c/property\u003e\n    \u003c/configuration\u003e\nkind: ConfigMap\nmetadata:\n  namespace: SPARK_JOB_NAMESPACE\n  name: spark-custom-config-map\n```\n\n### Submitting your application\n\nIn order to submit an application with glue support, you need to add a reference to the configmap in your `SparkApplication` spec.\n\n```yaml\nkind: SparkApplication\nmetadata:\n  name: \"my-spark-app\"\n  namespace: SPARK_JOB_NAMESPACE\nspec:\n  sparkConfigMap: spark-custom-config-map\n```\n\n## Working with the spark history server on S3\n\n- Use the appropriate spark version and deploy the [helm](https://github.com/helm/charts/blob/master/stable/spark-history-server/) chart\n\n- Flux / Helm values reference [here](./flux/history.yaml)\n\n## FAQ\n\n- Where can I find a Spark 2 build with Glue support?\n\n  As spark 2 becomes less and less relevant, I opted against the need to add glue support.\n  You can take a look [here](https://github.com/bbenzikry/spark-glue/blob/main/build.sh) for a reference build script which you can use to build a Spark 2 distribution to use with the Spark 2 [dockerfile](./docker/spark2.Dockerfile)\n\n- Why a patched operator image?\n\n  The patched image is a simple implementation for properly working with custom configuration files with the spark operator.\n  It may be added as a PR in the future or another implementation will take its place. For more information, see the related issue https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/216\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbbenzikry%2Fspark-eks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbbenzikry%2Fspark-eks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbbenzikry%2Fspark-eks/lists"}