https://github.com/sunsided/spark-atlas

Spark vs. MongoDB Atlas
https://github.com/sunsided/spark-atlas

data-processing docker jupyter-notebook mongodb mongodb-atlas pyspark python spark

Last synced: 7 months ago
JSON representation

Spark vs. MongoDB Atlas

Host: GitHub
URL: https://github.com/sunsided/spark-atlas
Owner: sunsided
Created: 2023-09-22T07:17:04.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-10-02T09:15:22.000Z (about 2 years ago)
Last Synced: 2025-02-13T08:27:42.076Z (9 months ago)
Topics: data-processing, docker, jupyter-notebook, mongodb, mongodb-atlas, pyspark, python, spark
Language: Jupyter Notebook
Homepage:
Size: 49.8 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # PySpark + MongoDB + SingleStore

Use Docker Compose to start the setup

```shell

docker compose up

```

This will start a setup of

- Spark Master (at [localhost:8090](http://localhost:8090/))

- Spark Worker with 2 CPUs and 4 GB RAM (at [localhost:8081](http://localhost:8081/))

- Spark Worker with 4 CPUs and 4 GB RAM (at [localhost:8082](http://localhost:8082/))

- Spark History Server (at [localhost:18081](http://localhost:18081/))

and

- Jupyter Lab (at [localhost:8888](http://127.0.0.1:8888/lab?token=5f69150501c3c0c4f94f5d4ae38123e2f556777f794bf48b))

Open JupyterLab [here](http://127.0.0.1:8888/lab?token=5f69150501c3c0c4f94f5d4ae38123e2f556777f794bf48b)

or connect to the Jupyter server at `127.0.0.1:8888` and use the following token:

```

5f69150501c3c0c4f94f5d4ae38123e2f556777f794bf48b

```

Use the [Aggegation Pipelines](notebooks/AggregationPipelines.ipynb) notebook

as a starting point.

## About the Dockerfile

The [Dockerfile](spark/Dockerfile) (as used in [docker-compose.yml](docker-compose.yml))

provides three different Docker targets, namely `master`, `worker` and `jupyter`.

All three targets share the same `base` images consisting of: 

- [Spark 3.4.1] (Scala 2.12 + Hadoop 3.3) + PySpark 3.4.1 + [MongoDB Connector for Spark 10.2] + [SingleStore JDBC 1.1.9]

- Ubuntu 23.04 with Java/OpenJDK 17 and Python 3.11

Using the same base image for Jupyter Lab and Spark was the only way to

get this setup working; specifically, having only `master` and `worker` images

and a predefined PySpark image would consistently fail with either JARs not being

found or serialization issues happening when running PySpark programs.

[Spark 3.4.1]: https://spark.apache.org/downloads.html

[MongoDB Connector for Spark 10.2]: https://www.mongodb.com/docs/spark-connector/v10.2/

[SingleStore JDBC 1.1.9]: https://github.com/memsql/S2-JDBC-Connector/releases/tag/v1.1.9

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sunsided/spark-atlas

Awesome Lists containing this project

README