https://github.com/guptaakashdeep/spark-minio-project
Builds a Spark Standalone Cluster on Docker in local with MinIO integration
https://github.com/guptaakashdeep/spark-minio-project
apache-spark minio open-table-format
Last synced: about 1 month ago
JSON representation
Builds a Spark Standalone Cluster on Docker in local with MinIO integration
- Host: GitHub
- URL: https://github.com/guptaakashdeep/spark-minio-project
- Owner: guptaakashdeep
- License: mit
- Created: 2024-12-21T18:01:10.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-22T19:52:03.000Z (over 1 year ago)
- Last Synced: 2025-06-14T04:41:21.859Z (12 months ago)
- Topics: apache-spark, minio, open-table-format
- Language: Jupyter Notebook
- Homepage:
- Size: 13.7 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# spark-minio-project
Builds a Spark Standalone Cluster on Docker in local with MinIO integration.
This can be a Base Project for building and playing around different integrations with Spark as compute engine and MinIO as Object Storage.
Few of the integration examples include:
- Open Format Tables (Apache Iceberg, Apache Hudi and Delta Lake)
- OLAP engines like StarRocks for analytics for tables populated from Spark and data present in MinIO
- Airflow integration to run Spark Jobs from Airflow
> This project was built by referring the structure from this [Repo](https://github.com/mrn-aglic/spark-standalone-cluster/blob/main/Readme.md) for Spark Standalone cluster, and modifications are made on top of it.
## Getting Started
To make getting started with project easy, it includes a [Makefile](Makefile)
In case, you are on Windows system, `make` commands might not work, in that case you can just use the respective docker commands present in the Makefile.
### Makefile Commands
#### Downloading Jars that needs to be used by Spark Standalone Cluster
---
To download the required jars from MVNRepositories to be used in Spark Standalone Cluster.
```bash
make jars-download
```
> Jars that needs to be downloaded are included in [Jar Downloader Bash Script](jar-downloader.sh)
This creates a jars folder in the repo and downloads all the mentioned jars there.
This currently includes jars to support MinIO integration and Read/Write into Iceberg tables.
**Note:** *To add more jars to be downloaded via jars-downloader, add those jars in `JAR_MAPPING` in format `jar_name|maven_path` present in the jars-downloader.sh*
#### Building the Docker Images
---
```bash
make build-nc
```
Builds images present in the [Dockerfile](Dockerfile), that includes:
- Spark 3.5 (for master, worker and SparkHistory Server)
- MinIO
- JupyterLab
#### Running the containers
---
```bash
make run
```
Starts all the services, once all the images are built.
This creates a Spark Standalone Cluster with *1 worker node*.
**To create a multi worker cluster**, run
```bash
make run-scaled
```
This creates a Spark Standalone Cluster with *3 worker nodes*. To modify the number of worked nodes, update the number directly in `Makefile` command.
Once all the services are up and running, these services can be accesed in local on below URL:
- Spark Master: `localhost:9090`
- Spark History Server: `localhost:18080`
- MinIO UI: `localhost:9001`
- Jupyter Notebook Server: `localhost:8888`
Spark Master starts at `spark://spark-master:7077`
#### Integration Testing and Submitting PySpark Scripts
---
There are scripts present in `spark_apps` to test if the integration is working fine after everything is up and running.
- To test the integration, you can run
```bash
make submit app=spark_minio_test.py
```
Once this runs successfully you can go to MinIO UI, login via `ROOT_USER` and `ROOT_PASSWORD` present in [.env file](.env), and you will be able to see the data within `warehouse` bucket.
- To submit a spark job
```bash
make submit app=pyfilename.py # file present in spark_apps folder
```
#### To shut down everything
---
```bash
make down
```
### Data Mapping
---
To keep the data persistent that is being written via Spark:
- MinIO Buckets data, `minio_data` folder in repo is mapped with MinIO container volume.
- To read any data files while running the jobs in Spark Cluster, data can either be uploaded directly to MinIO buckets from MinIO UI running at `localhost:9001` or can be kept at `data` folder in repo.
- `data` folder in Spark Cluster is mapped to `/opt/spark/data`
- To read data from here in spark jobs:
`spark.read.parquet('/opt/spark/data/file.parquet')`
> It's recommended to upload files in MinIO directly, to keep it simple to read data from Spark Jobs as well as the Jupyter Notebooks.