https://github.com/fabioba/heart-disease-analysis

This project includes dwh development and ml pipeline to predict heart diseases.
https://github.com/fabioba/heart-disease-analysis

airflow data-pipeline data-warehouse etl machine-learning mlflow mlops python sql

Last synced: about 2 months ago
JSON representation

This project includes dwh development and ml pipeline to predict heart diseases.

Host: GitHub
URL: https://github.com/fabioba/heart-disease-analysis
Owner: fabioba
Created: 2022-10-20T19:43:26.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-02-26T21:26:06.000Z (over 3 years ago)
Last Synced: 2025-03-23T03:41:25.533Z (over 1 year ago)
Topics: airflow, data-pipeline, data-warehouse, etl, machine-learning, mlflow, mlops, python, sql
Language: Python
Homepage:
Size: 1.32 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # HEART-DISEASE-ANALYSIS

## Table of content

- [Business Context](#business_context)

- [Data Sources](#data_sources)

- [System Design](#system_design)

    * [System Design - Data Warehouse development](#system_design_dwh)

        - [System Design - Data Source](#system_design_data_source)

        - [System Design - Data Transformation/Loading](#system_design_data_transformation)

        - [System Design - Data Warehouse](#system_design_data_warehouse)

    * [System Design - Machine Learning Pipeline](#ml_pipeline)

- [Tech Stack](#tech_stack)

    * [Docker](#docker)

    * [Airflow](#airflow)

    * [MLFlow](#mlflow)

    * [PostgreSQL](#postgresql)

- [References](#references)



## Business Context

The goal of this project is to analyze heart data to predict hypothetical future diseases.



## Data Sources

[Here](https://www.kaggle.com/code/nairkarthik16/eda-and-prediction/data) you can find the source data of this project.



## System Design

The project has been composed by two parts:

* `Data Warehouse` development

* `Machine Learning` pipeline

Both parts are implemented in `Airflow` as dag. So, each of them is composed by a sequence of tasks to accomplish a goal.



## System Design - Data Warehouse development

![img](docs/imgs/etl_workflow.drawio.png)



### System Design - Data Source

The input data are stored locally in a way that they are available from Docker containers.



### System Design - Data Transformation/Loading

The transformation and loading operations are accomplished by the etl_dag script run on Airflow.

This DAG is responsible for extracting data (locally), transform and load into a `PostgreSQL` table.

It's possible to review `PostgreSQL` tables from `PgAdmin`.

Below there's the `ETL` workflow on `Airflow`:

![img](docs/imgs/etl_dag.png)



### System Design - Data Warehouse

The Data Warehouse of the project has been stored on PostgreSQL.

Below there are the schemas of `heart_fact`, `heart_disease_dim` and `account_dim`.

```

CREATE TABLE IF NOT EXISTS heart_analysis.heart_fact(

	"account_id" varchar,

    "age" int,

    "sex" int,

    "cp" int,

    "trestbps" int,

    "chol" int,

    "fbs" int,

    "restecg" int,

    "thalach" int,

    "exang" int,

    "oldpeak" float,

    "slope" int,

    "ca" int,

    "thal" int,

    "target" int,

    PRIMARY KEY("account_id")

);

CREATE TABLE IF NOT EXISTS heart_analysis.heart_disease_dim(

	"account_id" varchar,

    "cp" int,

    "trestbps" int,

    "chol" int,

    "fbs" int,

    "restecg" int,

    "thalach" int,

    "exang" int,

    "oldpeak" float,

    "slope" int,

    "ca" int,

    "thal" int,

    "target" int,

    PRIMARY KEY("account_id")

);

CREATE TABLE IF NOT EXISTS heart_analysis.account_dim(

	"account_id" varchar,

    "age" int,

    "sex" int,

    PRIMARY KEY("account_id")

);

```

![alt](docs/imgs/er.drawio.png)



## System Design - Machine Learning Pipeline

Below there's the `ML pipeline` on `Airflow`:

![img](docs/imgs/ml_workflow.png)



### Tech Stack



### Docker

Create `docker-compose.yaml` which is responsible for running `Airflow` components, each on a different container:

* airflow-webserver

* airflow-scheduler

* airflow-worker

* airflow-triggerer

* mlflow server

* postgresql

* pgadmin

From terminal, run the following command to start Airflow on port 8080:

```

docker compose up -d

```



### Airflow

After running docker container, visit the page: `localhost:8080`

![img](docs/imgs/airflow_home.png)

And log into the Airflow world!

Populate the `dags` folder with all the DAGS needed for the project.

Before running any DAGs, establish a connection with PostgreSQL.



### MLFlow

On the `docker-compose.yaml` includes the `mlflow` container in the `services` section.

This container is responsible for running the `MLFlow server` exposed on the `localhost:600`.

![img](docs/imgs/mlflow_home.png)

Open the `example_dag.py` and set the URI of the current MLFlow server(localhost:600)

```

mlflow.set_tracking_uri('http://mlflow:600')

```

After updating the URI of the MLFlow server, create a new connection on `Airflow`.

The experiment section on `MLflow` provides a table to compare experiment:

![img](docs/imgs/mlflow_experiment.png)



### PostgreSQL

On the `docker-compose.yaml` includes the `postgres` and `pgadmin` containers in the `services` section.

First of all, access to `localhost:5050` to create a connection to `postgres`

![img](docs/imgs/pg_admin.png)

Then, on the section server it's easy to monitor and query those tables.



### References

* [Airflow Docker](https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html)

* [MLOps deployment](https://towardsdatascience.com/ml-model-deployment-strategies-72044b3c1410)

* [Integrate MLFlow](https://medium.com/@kaanboke/step-by-step-mlflow-implementations-a9872dd32d9b)

* [Productionize on Docker](https://medium.com/cometheartbeat/create-an-mlops-pipeline-with-github-and-docker-hub-in-minutes-4a1515b6a551)

* [Setup PGAdmin](https://hevodata.com/learn/pgadmin-docker/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fabioba/heart-disease-analysis

Awesome Lists containing this project

README