{"id":22845199,"url":"https://github.com/msmenegol/datapark","last_synced_at":"2026-02-19T13:01:03.125Z","repository":{"id":267447282,"uuid":"859697284","full_name":"msmenegol/datapark","owner":"msmenegol","description":"Datapark: a self-hosted data platform","archived":false,"fork":false,"pushed_at":"2025-02-06T13:07:33.000Z","size":69,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-10T14:45:53.421Z","etag":null,"topics":["airflow","data","data-engineering","data-science","jupyter-notebook","machine-learning","minio","mlflow","postgresql","spark"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/msmenegol.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-09-19T06:01:26.000Z","updated_at":"2025-02-06T13:07:37.000Z","dependencies_parsed_at":"2025-05-07T02:52:11.240Z","dependency_job_id":null,"html_url":"https://github.com/msmenegol/datapark","commit_stats":null,"previous_names":["msmenegol/datapark"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/msmenegol/datapark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msmenegol%2Fdatapark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msmenegol%2Fdatapark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msmenegol%2Fdatapark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msmenegol%2Fdatapark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/msmenegol","download_url":"https://codeload.github.com/msmenegol/datapark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msmenegol%2Fdatapark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29614585,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-19T10:52:55.328Z","status":"ssl_error","status_checked_at":"2026-02-19T10:52:26.323Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","data","data-engineering","data-science","jupyter-notebook","machine-learning","minio","mlflow","postgresql","spark"],"created_at":"2024-12-13T03:16:09.433Z","updated_at":"2026-02-19T13:01:03.105Z","avatar_url":"https://github.com/msmenegol.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DATAPARK\n\nDatapark is a self-hosted data platform for educational purposes. It consists of a collection of containerized services that allow the user to build solutions for data-related problems. To use them, you'll need to have [docker](https://docs.docker.com/) installed. On the [docker-compose](docker-compose.yaml) file you can find the following services:\n\n- [jupyterlab](https://jupyter.org/): a Jupyter lab server. This is where a developer should be able to use notebooks for handling their data and prototyping their solution.\n- [postgresql](https://www.postgresql.org/): a PostgreSQL database. This can be used for storing data. It's used by other services to store their metadata, such as Minio and MLFlow.\n- [minio](https://min.io/): a Minio storage service. It behaves similarly to S3 (AWS). This is the intended place for storing data.\n- [mlflow](https://mlflow.org/): a MLFlow tracking server to support machine leraning tasks and applications.\n- [spark](https://spark.apache.org/): the 3 Spark containers (one master and two workers) provide a Spark cluster that can be used for computing tasks.\n- [airflow](https://airflow.apache.org/): the 3 Airflow containers (one for setting up, one for the web-ui, and one for the scheduler) allows for the scheduling and monitoring of data workflows.\n\nTo use, simply clone this repository.\nTo run everything (on a Unix/WSL terminal):\n```shell\ndocker compose up -d\n```\n\nTo shut it down:\n```shell\ndocker compose down\n```\n\nTo access the different services on the browser:\n- jupyterlab: http://localhost:8888\n- minio: http://localhost:9001\n- mlflow: http://localhost:8080\n- airflow: http://localhost:8081\n- spark: http://localhost:9090\n\nYou can find usernames and password for the different services on the [.env](.env) file. I've had some issues with the terminal loading the variables in the .env file and then passing them on to `docker compose`, not allowing it to reload the contents of the file. Therefore, if you make changes to it, I suggest restarting the services from a fresh terminal. Please make sure you change those before using, especially the passwords.\nThe platform has examples to help you use the different services from notebooks. There is also an example on how to build Airflow DAGs that run on Spark.\nBy defaut, notebooks are stored on `platform/jupyterlab/notebooks/` and DAGs can be found on `platform/airflow/dags`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsmenegol%2Fdatapark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmsmenegol%2Fdatapark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsmenegol%2Fdatapark/lists"}