{"id":24750811,"url":"https://github.com/fabioba/heart-disease-analysis","last_synced_at":"2026-05-10T03:12:22.477Z","repository":{"id":231612415,"uuid":"555009500","full_name":"fabioba/heart-disease-analysis","owner":"fabioba","description":"This project includes dwh development and ml pipeline to predict heart diseases.","archived":false,"fork":false,"pushed_at":"2023-02-26T21:26:06.000Z","size":1381,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-23T03:41:25.533Z","etag":null,"topics":["airflow","data-pipeline","data-warehouse","etl","machine-learning","mlflow","mlops","python","sql"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fabioba.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-10-20T19:43:26.000Z","updated_at":"2022-12-13T22:29:46.000Z","dependencies_parsed_at":"2024-04-04T22:45:22.146Z","dependency_job_id":null,"html_url":"https://github.com/fabioba/heart-disease-analysis","commit_stats":null,"previous_names":["fabioba/heart-disease-analysis"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/fabioba/heart-disease-analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fabioba%2Fheart-disease-analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fabioba%2Fheart-disease-analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fabioba%2Fheart-disease-analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fabioba%2Fheart-disease-analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fabioba","download_url":"https://codeload.github.com/fabioba/heart-disease-analysis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fabioba%2Fheart-disease-analysis/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260301995,"owners_count":22988719,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","data-pipeline","data-warehouse","etl","machine-learning","mlflow","mlops","python","sql"],"created_at":"2025-01-28T09:09:01.527Z","updated_at":"2026-05-10T03:12:22.432Z","avatar_url":"https://github.com/fabioba.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# HEART-DISEASE-ANALYSIS\n\n\n## Table of content\n- [Business Context](#business_context)\n- [Data Sources](#data_sources)\n- [System Design](#system_design)\n    * [System Design - Data Warehouse development](#system_design_dwh)\n        - [System Design - Data Source](#system_design_data_source)\n        - [System Design - Data Transformation/Loading](#system_design_data_transformation)\n        - [System Design - Data Warehouse](#system_design_data_warehouse)\n    * [System Design - Machine Learning Pipeline](#ml_pipeline)\n- [Tech Stack](#tech_stack)\n    * [Docker](#docker)\n    * [Airflow](#airflow)\n    * [MLFlow](#mlflow)\n    * [PostgreSQL](#postgresql)\n- [References](#references)\n\n\n\u003ca name=\"business_context\"/\u003e\n\n## Business Context\nThe goal of this project is to analyze heart data to predict hypothetical future diseases.\n\n\u003ca name=\"data_sources\"/\u003e\n\n## Data Sources\n[Here](https://www.kaggle.com/code/nairkarthik16/eda-and-prediction/data) you can find the source data of this project.\n\n\u003ca name=\"system_design\"/\u003e\n\n## System Design\nThe project has been composed by two parts:\n* `Data Warehouse` development\n* `Machine Learning` pipeline\n\nBoth parts are implemented in `Airflow` as dag. So, each of them is composed by a sequence of tasks to accomplish a goal.\n\n\u003ca name=\"system_design_dwh\"/\u003e\n\n## System Design - Data Warehouse development\n![img](docs/imgs/etl_workflow.drawio.png)\n\n\u003ca name=\"system_design_data_source\"/\u003e\n\n### System Design - Data Source\nThe input data are stored locally in a way that they are available from Docker containers.\n\n\u003ca name=\"system_design_data_transformation\"/\u003e\n\n### System Design - Data Transformation/Loading\nThe transformation and loading operations are accomplished by the etl_dag script run on Airflow.\n\nThis DAG is responsible for extracting data (locally), transform and load into a `PostgreSQL` table.\n\nIt's possible to review `PostgreSQL` tables from `PgAdmin`.\nBelow there's the `ETL` workflow on `Airflow`:\n![img](docs/imgs/etl_dag.png)\n\n\n\u003ca name=\"system_design_data_warehouse\"/\u003e\n\n### System Design - Data Warehouse\nThe Data Warehouse of the project has been stored on PostgreSQL.\n\nBelow there are the schemas of `heart_fact`, `heart_disease_dim` and `account_dim`.\n```\nCREATE TABLE IF NOT EXISTS heart_analysis.heart_fact(\n\t\"account_id\" varchar,\n    \"age\" int,\n    \"sex\" int,\n    \"cp\" int,\n    \"trestbps\" int,\n    \"chol\" int,\n    \"fbs\" int,\n    \"restecg\" int,\n    \"thalach\" int,\n    \"exang\" int,\n    \"oldpeak\" float,\n    \"slope\" int,\n    \"ca\" int,\n    \"thal\" int,\n    \"target\" int,\n    PRIMARY KEY(\"account_id\")\n);\n\nCREATE TABLE IF NOT EXISTS heart_analysis.heart_disease_dim(\n\t\"account_id\" varchar,\n    \"cp\" int,\n    \"trestbps\" int,\n    \"chol\" int,\n    \"fbs\" int,\n    \"restecg\" int,\n    \"thalach\" int,\n    \"exang\" int,\n    \"oldpeak\" float,\n    \"slope\" int,\n    \"ca\" int,\n    \"thal\" int,\n    \"target\" int,\n    PRIMARY KEY(\"account_id\")\n);\n\nCREATE TABLE IF NOT EXISTS heart_analysis.account_dim(\n\t\"account_id\" varchar,\n    \"age\" int,\n    \"sex\" int,\n    PRIMARY KEY(\"account_id\")\n);\n```\n![alt](docs/imgs/er.drawio.png)\n\n\n\u003ca name=\"ml_pipeline\"/\u003e\n\n## System Design - Machine Learning Pipeline\n\nBelow there's the `ML pipeline` on `Airflow`:\n![img](docs/imgs/ml_workflow.png)\n\n\u003ca name=\"tech_stack\"/\u003e\n\n### Tech Stack\n\n\u003ca name=\"docker\"/\u003e\n\n### Docker\nCreate `docker-compose.yaml` which is responsible for running `Airflow` components, each on a different container:\n* airflow-webserver\n* airflow-scheduler\n* airflow-worker\n* airflow-triggerer\n* mlflow server\n* postgresql\n* pgadmin\n\nFrom terminal, run the following command to start Airflow on port 8080:\n```\ndocker compose up -d\n```\n\n\u003ca name=\"airflow\"/\u003e\n\n### Airflow\nAfter running docker container, visit the page: `localhost:8080`\n![img](docs/imgs/airflow_home.png)\n\nAnd log into the Airflow world!\n\nPopulate the `dags` folder with all the DAGS needed for the project.\nBefore running any DAGs, establish a connection with PostgreSQL.\n\n\u003ca name=\"mlflow\"/\u003e\n\n### MLFlow\nOn the `docker-compose.yaml` includes the `mlflow` container in the `services` section.\nThis container is responsible for running the `MLFlow server` exposed on the `localhost:600`.\n![img](docs/imgs/mlflow_home.png)\n\nOpen the `example_dag.py` and set the URI of the current MLFlow server(localhost:600)\n```\nmlflow.set_tracking_uri('http://mlflow:600')\n```\n\nAfter updating the URI of the MLFlow server, create a new connection on `Airflow`.\nThe experiment section on `MLflow` provides a table to compare experiment:\n![img](docs/imgs/mlflow_experiment.png)\n\n\u003ca name=\"postgresql\"/\u003e\n\n### PostgreSQL\nOn the `docker-compose.yaml` includes the `postgres` and `pgadmin` containers in the `services` section.\nFirst of all, access to `localhost:5050` to create a connection to `postgres`\n![img](docs/imgs/pg_admin.png)\n\nThen, on the section server it's easy to monitor and query those tables.\n\n\u003ca name=\"references\"/\u003e\n\n### References\n* [Airflow Docker](https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html)\n* [MLOps deployment](https://towardsdatascience.com/ml-model-deployment-strategies-72044b3c1410)\n* [Integrate MLFlow](https://medium.com/@kaanboke/step-by-step-mlflow-implementations-a9872dd32d9b)\n* [Productionize on Docker](https://medium.com/cometheartbeat/create-an-mlops-pipeline-with-github-and-docker-hub-in-minutes-4a1515b6a551)\n* [Setup PGAdmin](https://hevodata.com/learn/pgadmin-docker/)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffabioba%2Fheart-disease-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffabioba%2Fheart-disease-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffabioba%2Fheart-disease-analysis/lists"}