Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ferranbt/sparkanywhere
Run Apache spark multicloud and serverless
https://github.com/ferranbt/sparkanywhere
kubernetes serverless spark
Last synced: 10 days ago
JSON representation
Run Apache spark multicloud and serverless
- Host: GitHub
- URL: https://github.com/ferranbt/sparkanywhere
- Owner: ferranbt
- Created: 2024-02-21T17:36:50.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-03-20T15:59:14.000Z (8 months ago)
- Last Synced: 2024-06-21T08:05:21.623Z (5 months ago)
- Topics: kubernetes, serverless, spark
- Language: Go
- Homepage:
- Size: 87.9 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Sparkanywhere
`Sparkanywhere` is a proof of concept to run Apache Spark multicloud and serverless on top of a container scheduler.
Unlike traditional Spark setups (`Yarn` or `Kubernetes`) that require pre-provisioning and planning of resources, in `sparkanywhere` all the computing resources are provisioned on demand and only for the time that the job is running.
It does not rely on any hosted Spark solution and it can work on top of any service that provides container deployment and inter-service DNS discovery (i.e. `docker`, `aws ecs`).
`Sparkanywhere` deploys the Spark job as a Kubernetes task and [shims]() the Kuberentes API to deploy the Pods (i.e. Spark tasks) on a different container scheduler. Then, if you use as a scheduler a container-as-a-service like ECS, the computation is serverless.
Supported providers:
- [`docker`](#run-with-docker): Local Docker provider.
- [`ecs`](#run-with-ecs): Amazon Elastic Container Service with Fargate.## Architecture
This is sequence diagram of the system:
![sequence diagram](./sequence_diagram.png)
## Usage
Clone the repository:
```bash
git [email protected]:ferranbt/sparkanywhere.git
```The example runs the builtin PI example from Spark with one distributed worker.
### Run with Docker
Run the example using Docker as a scheduler
```bash
go run main.go --docker [--instances 1]
```### Run with ECS
First, you have to create an ECS cluster and a VPC with a public subnet. The tasks must run in a public subnet to pull the public Spark docker images.
```bash
$ cd terraform
$ terraform apply
```Once it is completed, it should output the name of the cluster, the id of the security group and the id of the public subnet.
```bash
$ terraform output
ecs_cluster_name = "..."
security_group = "..."
subnet = "..."
```In order for the driver task to find the K8s API of `sparkanywhere`, the binary must be executed in a machine with a reachable IP address (see architecture diagram).
```bash
go run main.go --ecs --ecs-cluster --ecs-security-group --ecs-subnet-id --control-plane-address
```## Future work
- Add support for other cloud providers like `GCP` or `Azure`.
- Parametrize the Spark job to run.
- Load tasks from S3 buckets.