{"id":14982285,"url":"https://github.com/rishav273/spark-cluster-multi-node-setup","last_synced_at":"2026-01-04T15:58:24.124Z","repository":{"id":255820995,"uuid":"849704376","full_name":"Rishav273/spark-cluster-multi-node-setup","owner":"Rishav273","description":"Quickly setup and simulate a multi node spark cluster using docker and docker-compose.","archived":false,"fork":false,"pushed_at":"2024-09-13T07:47:50.000Z","size":171,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-09-29T07:01:37.291Z","etag":null,"topics":["docker","docker-compose","pyspark","python3","spark"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Rishav273.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-30T04:58:35.000Z","updated_at":"2024-09-13T07:47:53.000Z","dependencies_parsed_at":"2024-09-13T17:52:36.748Z","dependency_job_id":null,"html_url":"https://github.com/Rishav273/spark-cluster-multi-node-setup","commit_stats":{"total_commits":27,"total_committers":2,"mean_commits":13.5,"dds":0.4814814814814815,"last_synced_commit":"d6621719e26ff7756c2a43e47fafe9c1306b09b0"},"previous_names":["rishav273/spark-cluster-multi-node-setup"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rishav273%2Fspark-cluster-multi-node-setup","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rishav273%2Fspark-cluster-multi-node-setup/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rishav273%2Fspark-cluster-multi-node-setup/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rishav273%2Fspark-cluster-multi-node-setup/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Rishav273","download_url":"https://codeload.github.com/Rishav273/spark-cluster-multi-node-setup/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":219857798,"owners_count":16556054,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","docker-compose","pyspark","python3","spark"],"created_at":"2024-09-24T14:05:04.615Z","updated_at":"2025-10-29T11:31:27.177Z","avatar_url":"https://github.com/Rishav273.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"https://github.com/user-attachments/assets/f2d90a13-94e3-4c9b-91e3-8b74d8b0e85f\" alt=\"Image description\" width=\"900\" /\u003e\n\n\n## Setting Up a Multi-Node Spark Cluster Locally Using Docker and Docker Compose\n\nThis guide will walk you through setting up a multi-node Apache Spark cluster locally using Docker and Docker Compose. Follow the steps below to get started.\n\n### Prerequisites\nBefore starting, make sure you have the following software installed on your system:\n\nDocker Desktop: Docker allows you to containerize applications. If you don’t have Docker Desktop installed, you can download and install it from the official Docker documentation:\nhttps://docs.docker.com/engine/install/\n\nDocker Compose: Docker Compose is a tool for defining and running multi-container Docker applications. You can install Docker Compose following the instructions here:\nhttps://docs.docker.com/compose/install/\n\nNote that Docker Desktop includes Docker Compose by default, so you might not need to install it separately if you have Docker Desktop.\n\nPython 3: Python is required for running certain scripts. Download and install Python 3 from the official Python website:\nhttps://www.python.org/downloads/\n\nGit: Git is needed for cloning repositories. Install Git from the official Git website:\nhttps://git-scm.com/downloads/\n\n### Setup instructions:\n\n* First, clone the repository that contains the Docker configuration for the Spark cluster::\n  ```\n  https://github.com/Rishav273/spark-cluster-multi-node-setup.git\n  ```\n  \n* Change your working directory to the folder where the repository was cloned:\n  ```\n  cd spark-cluster-multi-node-setup\n  ```\n  \n* Create a local virtual environment for installing all dependencies and activate it:\n  ```\n  python -m venv venv # windows\n  venv\\Scripts\\activate # windows\n  \n  python3 -m venv venv # macOS\n  source venv/bin/activate # macOS\n  ```\n\n* Additional configurations:\n\n  - All secret keys, credentials, and other sensitive information should be stored in a dedicated secrets folder.\n  ```\n  mkdir secrets\n  ```\n  - This folder should be mounted to each container using Docker volumes, as specified in the docker-  compose.yml file.\n  - In the config.py file present in the config sub-directory (in the scripts directory), paths for the bucket_name, files and service account file path will be given. These can be changed as required.\n\n\n* Install the necessary Python dependencies listed in the requirements.txt file::\n  ```\n  pip install -r requirements.txt\n  ```\n  \n  \n* Use Docker Compose to bring up the Spark cluster in detached mode. This will start all the containers defined in the docker-compose.yml file::\n  ```\n  docker-compose up --build -d  # Run this command the first time to build and start the cluster.\n  docker-compose up -d          # Use this command to start the cluster after the initial build (not needed immediately after the first build since the cluster will already be running).\n  docker-compose stop           # Stop the running cluster.\n  docker-compose down           # Shut down and remove all containers in the cluster.\n  ```\n\n  The -d flag runs the containers in detached mode, meaning they will run in the background.\n  \n\n* After starting the containers, verify that the Spark cluster is up and running by opening the Spark Web UI in your browser::\n  ```  \n  http://localhost:8080/\n  ```\n  You should see the Spark master web interface, indicating that your multi-node Spark cluster is running correctly.\n\n\n#### Note\nIn the ```scripts``` directory, there are Python scripts with PySpark code, including:\n- ```simple_spark_job.py``` -\u003e A basic PySpark application that creates a dummy DataFrame and performs aggregations on it.\n- ```read_from_gcp.py``` -\u003e An application that reads files from a Google Cloud Storage (GCS) bucket and performs aggregations on the data.\n\nAdditionally, custom scripts can be created and added to this directory as needed.\n\n### Run commands:\nAll run commands are available in the ```commands.sh``` file. \n\n### Cluster Configurations:\nChanges to the cluster can be made by modifying the ```docker-compose.yml``` file. For instance, you can increase the number of workers by adding more instances of the worker configuration in the file.\n  \n### Conclusion\nYou have successfully set up a multi-node Spark cluster locally using Docker and Docker Compose. You can now use this environment for simulating distributed data processing activities.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frishav273%2Fspark-cluster-multi-node-setup","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frishav273%2Fspark-cluster-multi-node-setup","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frishav273%2Fspark-cluster-multi-node-setup/lists"}