{"id":19017367,"url":"https://github.com/ferranbt/sparkanywhere","last_synced_at":"2026-04-18T00:03:12.062Z","repository":{"id":223730711,"uuid":"761344158","full_name":"ferranbt/sparkanywhere","owner":"ferranbt","description":"Run Apache spark multicloud and serverless","archived":false,"fork":false,"pushed_at":"2024-03-20T15:59:14.000Z","size":90,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-01T23:30:01.806Z","etag":null,"topics":["kubernetes","serverless","spark"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ferranbt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-21T17:36:50.000Z","updated_at":"2024-02-21T17:44:23.000Z","dependencies_parsed_at":"2024-06-21T06:55:50.635Z","dependency_job_id":"212db986-cabb-4d33-8f7c-306ad44f7bb7","html_url":"https://github.com/ferranbt/sparkanywhere","commit_stats":null,"previous_names":["ferranbt/sparkanywhere"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ferranbt%2Fsparkanywhere","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ferranbt%2Fsparkanywhere/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ferranbt%2Fsparkanywhere/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ferranbt%2Fsparkanywhere/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ferranbt","download_url":"https://codeload.github.com/ferranbt/sparkanywhere/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240061613,"owners_count":19742061,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["kubernetes","serverless","spark"],"created_at":"2024-11-08T19:46:45.545Z","updated_at":"2026-04-18T00:03:07.006Z","avatar_url":"https://github.com/ferranbt.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sparkanywhere\n\n`Sparkanywhere` is a proof of concept to run Apache Spark multicloud and serverless on top of a container scheduler.\n\nUnlike traditional Spark setups (`Yarn` or `Kubernetes`) that require pre-provisioning and planning of resources, in `sparkanywhere` all the computing resources are provisioned on demand and only for the time that the job is running.\n\nIt does not rely on any hosted Spark solution and it can work on top of any service that provides container deployment and inter-service DNS discovery (i.e. `docker`, `aws ecs`).\n\n`Sparkanywhere` deploys the Spark job as a Kubernetes task and [shims](\u003chttps://www.google.com/url?sa=t\u0026rct=j\u0026q=\u0026esrc=s\u0026source=web\u0026cd=\u0026ved=2ahUKEwjb34G9_LyEAxVVUaQEHW-VDqUQFnoECBYQAQ\u0026url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FShim_(computing)\u0026usg=AOvVaw2UIsBXHBUIOXEQZiJP3vXL\u0026opi=89978449\u003e) the Kuberentes API to deploy the Pods (i.e. Spark tasks) on a different container scheduler. Then, if you use as a scheduler a container-as-a-service like ECS, the computation is serverless.\n\nSupported providers:\n\n- [`docker`](#run-with-docker): Local Docker provider.\n- [`ecs`](#run-with-ecs): Amazon Elastic Container Service with Fargate.\n\n## Architecture\n\nThis is sequence diagram of the system:\n\n![sequence diagram](./sequence_diagram.png)\n\n## Usage\n\nClone the repository:\n\n```bash\ngit git@github.com:ferranbt/sparkanywhere.git\n```\n\nThe example runs the builtin PI example from Spark with one distributed worker.\n\n### Run with Docker\n\nRun the example using Docker as a scheduler\n\n```bash\ngo run main.go --docker [--instances 1]\n```\n\n### Run with ECS\n\nFirst, you have to create an ECS cluster and a VPC with a public subnet. The tasks must run in a public subnet to pull the public Spark docker images.\n\n```bash\n$ cd terraform\n$ terraform apply\n```\n\nOnce it is completed, it should output the name of the cluster, the id of the security group and the id of the public subnet.\n\n```bash\n$ terraform output\necs_cluster_name = \"...\"\nsecurity_group = \"...\"\nsubnet = \"...\"\n```\n\nIn order for the driver task to find the K8s API of `sparkanywhere`, the binary must be executed in a machine with a reachable IP address (see architecture diagram).\n\n```bash\ngo run main.go --ecs --ecs-cluster \u003ccluster name\u003e --ecs-security-group \u003csecurity group id\u003e --ecs-subnet-id \u003csubnet id\u003e --control-plane-address \u003cpublic ip of sparkanywhere\u003e\n```\n\n## Future work\n\n- Add support for other cloud providers like `GCP` or `Azure`.\n- Parametrize the Spark job to run.\n- Load tasks from S3 buckets.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fferranbt%2Fsparkanywhere","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fferranbt%2Fsparkanywhere","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fferranbt%2Fsparkanywhere/lists"}