{"id":14982342,"url":"https://github.com/gr-menon/spark-bazaar","last_synced_at":"2026-01-26T12:35:37.466Z","repository":{"id":248912414,"uuid":"830135625","full_name":"GR-Menon/Spark-Bazaar","owner":"GR-Menon","description":"A collection of Apache Spark cluster setups using Docker","archived":false,"fork":false,"pushed_at":"2024-10-17T13:02:14.000Z","size":22920,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-02T01:31:43.477Z","etag":null,"topics":["apache-spark","docker","docker-compose"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GR-Menon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-17T16:58:19.000Z","updated_at":"2024-12-13T12:14:24.000Z","dependencies_parsed_at":null,"dependency_job_id":"11f304d1-9f00-4098-b2eb-3bb57c64761a","html_url":"https://github.com/GR-Menon/Spark-Bazaar","commit_stats":{"total_commits":5,"total_committers":2,"mean_commits":2.5,"dds":0.4,"last_synced_commit":"3dc479e2b43972d928d737aa9b8f83ab19f93c65"},"previous_names":["gr-menon/spark-bazaar"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GR-Menon%2FSpark-Bazaar","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GR-Menon%2FSpark-Bazaar/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GR-Menon%2FSpark-Bazaar/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GR-Menon%2FSpark-Bazaar/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GR-Menon","download_url":"https://codeload.github.com/GR-Menon/Spark-Bazaar/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238825723,"owners_count":19537117,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","docker","docker-compose"],"created_at":"2024-09-24T14:05:14.096Z","updated_at":"2025-10-29T12:31:33.055Z","avatar_url":"https://github.com/GR-Menon.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark Bazaar\nA collection of Apache Spark cluster setups using Docker.\n\n![Spark Bazaar](assets/spark-bazaar.png)    \n\u003c/br\u003e   \n\n# Context\n\nThis is a pre-cursor to the BigBanyanTree project, an initiative to empower engineering colleges to set up their data engineering clusters and drive interest in data processing and analysis using tools such as Apache Spark.\n\nThis work was done in collaboration with [Suchit G](https://www.linkedin.com/in/suchitg04/) under the guidance of [Harsh Singhal](https://www.linkedin.com/in/harshsinghal/).\n\nThe data extracted using the `Spark Cluster - Hetzner` has been open-sourced on [HuggingFace](https://huggingface.co/big-banyan-tree).\n\n\u003c/br\u003e  \n\n# Basic Cluster\n\nThis is a basic [Apache Spark](https://spark.apache.org/) cluster recreated from this blog:     \n\n\u003c/br\u003e   \n\n\u003e [Spark Standalone Cluster on Docker](https://medium.com/@MarinAgli1/setting-up-a-spark-standalone-cluster-on-docker-in-layman-terms-8cbdc9fdd14b)\n\n\u003c/br\u003e    \n\nThe cluster comprises a single `Docker` image running Apache Spark, and its different services orchestrated using \n`Docker Compose`. It uses an entrypoint shell script, to start up different services based on the Spark Workload, like `spark-master`, `spark-worker` and `spark-history-server`.\n\nWe also make use of a `Makefile` for ease of spinning up and tearing down the Spark cluster services.\n\nTo run the basic cluster, navigate to the `Spark Cluster - Basic` directory and run:\n```bash\nmake run-scaled\n```\nThis will spin up a standalone Spark cluster with 2 worker nodes.\n\n\u003c/br\u003e\n\n# Jupyterlab Cluster\n\nThis is an Apache Spark cluster in standalone mode, accompanied by a user-friendly Jupyterlab interface to run Spark jobs. This cluster setup is based on this blog:\n\n\u003c/br\u003e\n\n\u003e [Apache Spark Cluster with Jupyterlab Interface](https://towardsdatascience.com/apache-spark-cluster-on-docker-ft-a-juyterlab-interface-418383c95445)\n\n\u003c/br\u003e\n\nThis setup takes a slightly different approach from the one before. Here, we make use of separate Docker images for each of the cluster services such as `spark-master`, `spark-worker`, `jupyterlab` and so on. As before, all these separate Docker images are orchestrated using Docker Compose.\n\nThe individual service images also make use of a common `cluster-base` Docker image to build the service on.\n\nTo run the jupyterlab cluster, navigate to `Spark Cluster - Jupyterlab` directory and run:\n```bash\nchmod +x .build.sh\nchmod +x .run.sh\n./build.sh\n./run.sh\n```\n\nThis will spin up a standalone Spark cluster with 2 worker nodes and a Jupyterlab interface.\n\n\u003c/br\u003e\n\n# Hetzner Cluster\n\nThis is the Apache Spark cluster setup used in the BigBanyanTree project. It takes a hybrid approach, taking the learnings from the previous two cluster setups.\n\nWe use the following Docker images:\n- `hetzner-base`: Base image for all services\n- `spark-cluster`: Apache Spark image with functionality for `spark-master`, `spark-worker` \u0026 `spark-history-server`\n- `jupyterlab`: Image for Jupyterlab interface\n- `llama8b`: Image for `Meta-Llama-3.1-8B-Instruct` service using `llamafile`\n\nCheck out the `llama8b` service setup here : https://datascience.fm/llamafile-an-executable-llm/\n\nCheck out an in-detail explanation of this entire setup here: https://datascience.fm/zero-to-spark-apache-spark-cluster-setup/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgr-menon%2Fspark-bazaar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgr-menon%2Fspark-bazaar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgr-menon%2Fspark-bazaar/lists"}