Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/msmenegol/datapark

Datapark: a self-hosted data platform
https://github.com/msmenegol/datapark

airflow data data-engineering data-science jupyter-notebook machine-learning minio mlflow postgresql spark

Last synced: 6 days ago
JSON representation

Datapark: a self-hosted data platform

Host: GitHub
URL: https://github.com/msmenegol/datapark
Owner: msmenegol
License: mit
Created: 2024-09-19T06:01:26.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-12-10T11:18:27.000Z (2 months ago)
Last Synced: 2024-12-10T12:28:32.636Z (2 months ago)
Topics: airflow, data, data-engineering, data-science, jupyter-notebook, machine-learning, minio, mlflow, postgresql, spark
Language: Jupyter Notebook
Homepage:
Size: 65.4 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# DATAPARK

Datapark is a self-hosted data platform for educational purposes. It consists of a collection of containerized services that allow the user to build solutions for data-related problems. To use them, you'll need to have [docker](https://docs.docker.com/) installed. On the [docker-compose](docker-compose.yaml) file you can find the following services:

- [jupyterlab](https://jupyter.org/): a Jupyter lab server. This is where a developer should be able to use notebooks for handling their data and prototyping their solution.
- [postgresql](https://www.postgresql.org/): a PostgreSQL database. This can be used for storing data. It's used by other services to store their metadata, such as Minio and MLFlow.
- [minio](https://min.io/): a Minio storage service. It behaves similarly to S3 (AWS). This is the intended place for storing data.
- [mlflow](https://mlflow.org/): a MLFlow tracking server to support machine leraning tasks and applications.
- [spark](https://spark.apache.org/): the 3 Spark containers (one master and two workers) provide a Spark cluster that can be used for computing tasks.
- [airflow](https://airflow.apache.org/): the 3 Airflow containers (one for setting up, one for the web-ui, and one for the scheduler) allows for the scheduling and monitoring of data workflows.

To use, simply clone this repository.
To run everything (on a Unix/WSL terminal):
```shell
docker compose up -d
```

To shut it down:
```shell
docker compose down
```

To access the different services on the browser:
- jupyterlab: http://localhost:8888
- minio: http://localhost:9001
- mlflow: http://localhost:8080
- airflow: http://localhost:8081
- spark: http://localhost:9090

You can find usernames and password for the different services on the [.env](.env) file. Please make sure you change those before using.
The platform has examples to help you use the different services from notebooks. There is also an example on how to build Airflow DAGs that run on Spark.
By defaut, notebooks are stored on `platform/jupyterlab/notebooks/` and DAGs can be found on `platform/airflow/dags`.