Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/KimaruThagna/ml-pipelines-airflow

Demonstrating and Building ML pipelines in Airflow
https://github.com/KimaruThagna/ml-pipelines-airflow

Last synced: 15 days ago
JSON representation

Demonstrating and Building ML pipelines in Airflow

Lists

README

        

# ETL-pipelines-airflow
Demonstrating and Building ETL pipelines in Airflow
This repo demonstrates a use case for n Ecommerce business that has a platform that generates transaction data each time a purchase is made. With this transaction data, the functions in the pipeline seek to answer 3 business questions

1. Who is our platinum customer? Anyone with purchase value equal to or more than 5000
2. What is the purchase history like for each user? This builds a dataset that can be used for a *recommendation engine* downstream
3. What items are commonly purchased together? This builds a dataset that can be used for *Basket Analysis* downstream

## Analysis Implementation
The code can be found in `etl_utils.py` file.
Question 1 is implemented using `pd.merge()` to get the combined dataset and `df.groupby().sum()` to get total purchases.

To get the platinum customer, we apply a filter

`final_df = df.loc[df['total_purchase_value']>=10000]`

Both question 2 and 3 are achieved using Pandas **Pivot Tables** `pd.pivtot_table()`

# Generated Data
The sample CSVs generated by the above functions can be found in the `samples/` folder.

# Airflow Connections
I have used the PostgresOperator which requires a postgres connection and a SimpleHttpOperator which requires a http connection. This is set in the Admin UI in the connections tab.