Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rishav273/spark-cluster-multi-node-setup
Quickly setup and simulate a multi node spark cluster using docker and docker-compose.
https://github.com/rishav273/spark-cluster-multi-node-setup
docker docker-compose pyspark python3 spark
Last synced: 4 months ago
JSON representation
Quickly setup and simulate a multi node spark cluster using docker and docker-compose.
- Host: GitHub
- URL: https://github.com/rishav273/spark-cluster-multi-node-setup
- Owner: Rishav273
- Created: 2024-08-30T04:58:35.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-09-13T07:47:50.000Z (5 months ago)
- Last Synced: 2024-09-29T07:01:37.291Z (4 months ago)
- Topics: docker, docker-compose, pyspark, python3, spark
- Language: Jupyter Notebook
- Homepage:
- Size: 167 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Setting Up a Multi-Node Spark Cluster Locally Using Docker and Docker Compose
This guide will walk you through setting up a multi-node Apache Spark cluster locally using Docker and Docker Compose. Follow the steps below to get started.
### Prerequisites
Before starting, make sure you have the following software installed on your system:Docker Desktop: Docker allows you to containerize applications. If you don’t have Docker Desktop installed, you can download and install it from the official Docker documentation:
https://docs.docker.com/engine/install/Docker Compose: Docker Compose is a tool for defining and running multi-container Docker applications. You can install Docker Compose following the instructions here:
https://docs.docker.com/compose/install/Note that Docker Desktop includes Docker Compose by default, so you might not need to install it separately if you have Docker Desktop.
Python 3: Python is required for running certain scripts. Download and install Python 3 from the official Python website:
https://www.python.org/downloads/Git: Git is needed for cloning repositories. Install Git from the official Git website:
https://git-scm.com/downloads/### Setup instructions:
* First, clone the repository that contains the Docker configuration for the Spark cluster::
```
https://github.com/Rishav273/spark-cluster-multi-node-setup.git
```
* Change your working directory to the folder where the repository was cloned:
```
cd spark-cluster-multi-node-setup
```
* Create a local virtual environment for installing all dependencies and activate it:
```
python -m venv venv # windows
venv\Scripts\activate # windows
python3 -m venv venv # macOS
source venv/bin/activate # macOS
```* Additional configurations:
- All secret keys, credentials, and other sensitive information should be stored in a dedicated secrets folder.
```
mkdir secrets
```
- This folder should be mounted to each container using Docker volumes, as specified in the docker- compose.yml file.
- In the config.py file present in the config sub-directory (in the scripts directory), paths for the bucket_name, files and service account file path will be given. These can be changed as required.* Install the necessary Python dependencies listed in the requirements.txt file::
```
pip install -r requirements.txt
```
* Use Docker Compose to bring up the Spark cluster in detached mode. This will start all the containers defined in the docker-compose.yml file::
```
docker-compose up --build -d # Run this command the first time to build and start the cluster.
docker-compose up -d # Use this command to start the cluster after the initial build (not needed immediately after the first build since the cluster will already be running).
docker-compose stop # Stop the running cluster.
docker-compose down # Shut down and remove all containers in the cluster.
```The -d flag runs the containers in detached mode, meaning they will run in the background.
* After starting the containers, verify that the Spark cluster is up and running by opening the Spark Web UI in your browser::
```
http://localhost:8080/
```
You should see the Spark master web interface, indicating that your multi-node Spark cluster is running correctly.#### Note
In the ```scripts``` directory, there are Python scripts with PySpark code, including:
- ```simple_spark_job.py``` -> A basic PySpark application that creates a dummy DataFrame and performs aggregations on it.
- ```read_from_gcp.py``` -> An application that reads files from a Google Cloud Storage (GCS) bucket and performs aggregations on the data.Additionally, custom scripts can be created and added to this directory as needed.
### Run commands:
All run commands are available in the ```commands.sh``` file.### Cluster Configurations:
Changes to the cluster can be made by modifying the ```docker-compose.yml``` file. For instance, you can increase the number of workers by adding more instances of the worker configuration in the file.
### Conclusion
You have successfully set up a multi-node Spark cluster locally using Docker and Docker Compose. You can now use this environment for simulating distributed data processing activities.