https://github.com/guptaakashdeep/spark-minio-project

Builds a Spark Standalone Cluster on Docker in local with MinIO integration
https://github.com/guptaakashdeep/spark-minio-project

apache-spark minio open-table-format

Last synced: about 1 month ago
JSON representation

Builds a Spark Standalone Cluster on Docker in local with MinIO integration

Host: GitHub
URL: https://github.com/guptaakashdeep/spark-minio-project
Owner: guptaakashdeep
License: mit
Created: 2024-12-21T18:01:10.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-12-22T19:52:03.000Z (over 1 year ago)
Last Synced: 2025-06-14T04:41:21.859Z (12 months ago)
Topics: apache-spark, minio, open-table-format
Language: Jupyter Notebook
Homepage:
Size: 13.7 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# spark-minio-project
Builds a Spark Standalone Cluster on Docker in local with MinIO integration.

This can be a Base Project for building and playing around different integrations with Spark as compute engine and MinIO as Object Storage.

Few of the integration examples include:
- Open Format Tables (Apache Iceberg, Apache Hudi and Delta Lake)
- OLAP engines like StarRocks for analytics for tables populated from Spark and data present in MinIO
- Airflow integration to run Spark Jobs from Airflow

> This project was built by referring the structure from this [Repo](https://github.com/mrn-aglic/spark-standalone-cluster/blob/main/Readme.md) for Spark Standalone cluster, and modifications are made on top of it.

## Getting Started
To make getting started with project easy, it includes a [Makefile](Makefile)

In case, you are on Windows system, `make` commands might not work, in that case you can just use the respective docker commands present in the Makefile.

### Makefile Commands

#### Downloading Jars that needs to be used by Spark Standalone Cluster
---
To download the required jars from MVNRepositories to be used in Spark Standalone Cluster.

```bash
make jars-download
```

> Jars that needs to be downloaded are included in [Jar Downloader Bash Script](jar-downloader.sh)

This creates a jars folder in the repo and downloads all the mentioned jars there.

This currently includes jars to support MinIO integration and Read/Write into Iceberg tables.

**Note:** *To add more jars to be downloaded via jars-downloader, add those jars in `JAR_MAPPING` in format `jar_name|maven_path` present in the jars-downloader.sh*

#### Building the Docker Images
---
```bash
make build-nc
```
Builds images present in the [Dockerfile](Dockerfile), that includes:
- Spark 3.5 (for master, worker and SparkHistory Server)
- MinIO
- JupyterLab

#### Running the containers
---
```bash
make run
```
Starts all the services, once all the images are built.

This creates a Spark Standalone Cluster with *1 worker node*.

**To create a multi worker cluster**, run
```bash
make run-scaled
```

This creates a Spark Standalone Cluster with *3 worker nodes*. To modify the number of worked nodes, update the number directly in `Makefile` command.

Once all the services are up and running, these services can be accesed in local on below URL:

- Spark Master: `localhost:9090`
- Spark History Server: `localhost:18080`
- MinIO UI: `localhost:9001`
- Jupyter Notebook Server: `localhost:8888`

Spark Master starts at `spark://spark-master:7077`

#### Integration Testing and Submitting PySpark Scripts
---
There are scripts present in `spark_apps` to test if the integration is working fine after everything is up and running.

- To test the integration, you can run
```bash
make submit app=spark_minio_test.py
```

Once this runs successfully you can go to MinIO UI, login via `ROOT_USER` and `ROOT_PASSWORD` present in [.env file](.env), and you will be able to see the data within `warehouse` bucket.

- To submit a spark job
```bash
make submit app=pyfilename.py # file present in spark_apps folder
```

#### To shut down everything
---
```bash
make down
```

### Data Mapping
---
To keep the data persistent that is being written via Spark:

- MinIO Buckets data, `minio_data` folder in repo is mapped with MinIO container volume.

- To read any data files while running the jobs in Spark Cluster, data can either be uploaded directly to MinIO buckets from MinIO UI running at `localhost:9001` or can be kept at `data` folder in repo.
- `data` folder in Spark Cluster is mapped to `/opt/spark/data`
- To read data from here in spark jobs:
`spark.read.parquet('/opt/spark/data/file.parquet')`

> It's recommended to upload files in MinIO directly, to keep it simple to read data from Spark Jobs as well as the Jupyter Notebooks.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/guptaakashdeep/spark-minio-project

Awesome Lists containing this project

README