https://github.com/bbenzikry/spark-eks
Examples and custom spark images for working with the spark-on-k8s operator on AWS
https://github.com/bbenzikry/spark-eks
aws docker dockerfile eks eks-cluster glue-catalog kubernetes kubernetes-operator metastore spark
Last synced: 7 months ago
JSON representation
Examples and custom spark images for working with the spark-on-k8s operator on AWS
- Host: GitHub
- URL: https://github.com/bbenzikry/spark-eks
- Owner: bbenzikry
- License: apache-2.0
- Created: 2020-08-27T23:59:11.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2021-02-14T01:56:38.000Z (over 4 years ago)
- Last Synced: 2025-03-17T04:05:46.997Z (7 months ago)
- Topics: aws, docker, dockerfile, eks, eks-cluster, glue-catalog, kubernetes, kubernetes-operator, metastore, spark
- Language: Dockerfile
- Homepage:
- Size: 382 KB
- Stars: 27
- Watchers: 2
- Forks: 5
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# spark-on-eks
Examples and custom spark images for working with the spark-on-k8s operator on AWS.
Allows using Spark 2 with IRSA and Spark 3 with IRSA and AWS Glue as a metastore.
Note: Spark 3 images also include relevant jars for working with the [S3A commiters](https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html)
If you're looking for the Spark 3 custom distributions, you can find them [here](https://github.com/bbenzikry/spark-glue/releases)
**Note**: Spark 2 images will not be updated, please see the [FAQ](#faq)
---
[](https://hub.docker.com/r/bbenzikry/spark-eks-operator)
[](https://hub.docker.com/r/bbenzikry/spark-eks)## Prerequisites
- Deploy [spark-on-k8s operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) using the [helm chart](https://github.com/helm/charts/tree/master/incubator/sparkoperator) and the [patched operator](https://github.com/bbenzikry/spark-on-k8s-operator/tree/hive-subpath) image `bbenzikry/spark-eks-operator:latest`
Suggested values for the helm chart can be found in the [flux](./flux/operator.yaml) example.
> Note: Do not create the spark service account automatically as part of chart use.
## using IAM roles for service accounts on EKS
### Creating roles and service account
- Create an AWS role for driver
- Create an AWS role for executors> [AWS docs on creating policies and roles](https://docs.aws.amazon.com/eks/latest/userguide/create-service-account-iam-policy-and-role.html)
- Add default service account EKS role for executors in your spark job namespace ( optional )
```yaml
# NOTE: Only required when not building spark from source or using a version of spark < 3.1. In 3.1, executor roles will rely on the driver definition. At the moment they execute with the default service account.
apiVersion: v1
kind: ServiceAccount
metadata:
name: default
namespace: SPARK_JOB_NAMESPACE
annotations:
# can also be the driver role
eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/executor-role"
```- Make sure spark service account ( used by driver pods ) is configured to an EKS role as well
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: SPARK_JOB_NAMESPACE
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/driver-role"
```### Building a compatible image
- For spark < 3.0.0, see [spark2.Dockerfile](./docker/spark2.Dockerfile)
- For spark 3.0.0+, see [spark3.Dockerfile](./docker/spark3.Dockerfile)
- For pyspark, see [pyspark.Dockerfile](./docker/pyspark.Dockerfile)
### Submit your spark application with IRSA support
#### Select the right implementation for you
> Below are examples for latest versions.
>
> If you want to use pinned versions, all images are tagged by the commit SHA.
>
> You can find a full list of tags [here](https://hub.docker.com/repository/docker/bbenzikry/spark-eks/tags)```dockerfile
# spark2
FROM bbenzikry/spark-eks:spark2-latest
# spark3
FROM bbenzikry/spark-eks:spark3-latest
# pyspark2
FROM bbenzikry/spark-eks:pyspark2-latest
# pyspark3
FROM bbenzikry/spark-eks:pyspark3-latest
```#### Submit your SparkApplication spec
```yaml
hadoopConf:
# IRSA configuration
"fs.s3a.aws.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
driver:
.....
labels:
.....
serviceAccount: SERVICE_ACCOUNT_NAME# See: https://github.com/kubernetes/kubernetes/issues/82573
# Note: securityContext has changed in recent versions of the operator to podSecurityContext
podSecurityContext:
fsGroup: 65534
```### Working with AWS Glue as metastore
#### Glue Prerequisites
- Make sure your driver and executor roles have the relevant glue permissions
```json5
{
/*
Example below depicts the IAM policy for accessing db1/table1.
Modify this as you deem worthy for spark application access.
*/Effect: "Allow",
Action: ["glue:*Database*", "glue:*Table*", "glue:*Partition*"],
Resource: [
"arn:aws:glue:us-west-2:123456789012:catalog",
"arn:aws:glue:us-west-2:123456789012:database/db1",
"arn:aws:glue:us-west-2:123456789012:table/db1/table1","arn:aws:glue:eu-west-1:123456789012:database/default",
"arn:aws:glue:eu-west-1:123456789012:database/global_temp",
"arn:aws:glue:eu-west-1:123456789012:database/parquet",
],
}
```- Make sure you are using the patched operator image
- Add a config map to your spark job namespace as defined [here](conf/configmap.yaml)```yaml
apiVersion: v1
data:
hive-site.xml: |-
hive.imetastoreclient.factory.class
com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
kind: ConfigMap
metadata:
namespace: SPARK_JOB_NAMESPACE
name: spark-custom-config-map
```### Submitting your application
In order to submit an application with glue support, you need to add a reference to the configmap in your `SparkApplication` spec.
```yaml
kind: SparkApplication
metadata:
name: "my-spark-app"
namespace: SPARK_JOB_NAMESPACE
spec:
sparkConfigMap: spark-custom-config-map
```## Working with the spark history server on S3
- Use the appropriate spark version and deploy the [helm](https://github.com/helm/charts/blob/master/stable/spark-history-server/) chart
- Flux / Helm values reference [here](./flux/history.yaml)
## FAQ
- Where can I find a Spark 2 build with Glue support?
As spark 2 becomes less and less relevant, I opted against the need to add glue support.
You can take a look [here](https://github.com/bbenzikry/spark-glue/blob/main/build.sh) for a reference build script which you can use to build a Spark 2 distribution to use with the Spark 2 [dockerfile](./docker/spark2.Dockerfile)- Why a patched operator image?
The patched image is a simple implementation for properly working with custom configuration files with the spark operator.
It may be added as a PR in the future or another implementation will take its place. For more information, see the related issue https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/216