Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/msmenegol/datapark
Datapark: a self-hosted data platform
https://github.com/msmenegol/datapark
airflow data data-engineering data-science jupyter-notebook machine-learning minio mlflow postgresql spark
Last synced: 29 days ago
JSON representation
Datapark: a self-hosted data platform
- Host: GitHub
- URL: https://github.com/msmenegol/datapark
- Owner: msmenegol
- License: mit
- Created: 2024-09-19T06:01:26.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-12-10T11:18:27.000Z (about 1 month ago)
- Last Synced: 2024-12-10T12:28:32.636Z (about 1 month ago)
- Topics: airflow, data, data-engineering, data-science, jupyter-notebook, machine-learning, minio, mlflow, postgresql, spark
- Language: Jupyter Notebook
- Homepage:
- Size: 65.4 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DATAPARK
Datapark is a self-hosted data platform for educational purposes. It consists of a collection of containerized services that allow the user to build solutions for data-related problems. To use them, you'll need to have [docker](https://docs.docker.com/) installed. On the [docker-compose](docker-compose.yaml) file you can find the following services:
- [jupyterlab](https://jupyter.org/): a Jupyter lab server. This is where a developer should be able to use notebooks for handling their data and prototyping their solution.
- [postgresql](https://www.postgresql.org/): a PostgreSQL database. This can be used for storing data. It's used by other services to store their metadata, such as Minio and MLFlow.
- [minio](https://min.io/): a Minio storage service. It behaves similarly to S3 (AWS). This is the intended place for storing data.
- [mlflow](https://mlflow.org/): a MLFlow tracking server to support machine leraning tasks and applications.
- [spark](https://spark.apache.org/): the 3 Spark containers (one master and two workers) provide a Spark cluster that can be used for computing tasks.
- [airflow](https://airflow.apache.org/): the 3 Airflow containers (one for setting up, one for the web-ui, and one for the scheduler) allows for the scheduling and monitoring of data workflows.To use, simply clone this repository.
To run everything (on a Unix/WSL terminal):
```shell
docker compose up -d
```To shut it down:
```shell
docker compose down
```To access the different services on the browser:
- jupyterlab: http://localhost:8888
- minio: http://localhost:9001
- mlflow: http://localhost:8080
- airflow: http://localhost:8081
- spark: http://localhost:9090You can find usernames and password for the different services on the [.env](.env) file. Please make sure you change those before using.
The platform has examples to help you use the different services from notebooks. There is also an example on how to build Airflow DAGs that run on Spark.
By defaut, notebooks are stored on `platform/jupyterlab/notebooks/` and DAGs can be found on `platform/airflow/dags`.