Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/KimaruThagna/ml-pipelines-airflow
Demonstrating and Building ML pipelines in Airflow
https://github.com/KimaruThagna/ml-pipelines-airflow
Last synced: 3 months ago
JSON representation
Demonstrating and Building ML pipelines in Airflow
- Host: GitHub
- URL: https://github.com/KimaruThagna/ml-pipelines-airflow
- Owner: KimaruThagna
- Created: 2021-05-29T16:54:41.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-06-12T19:04:21.000Z (over 3 years ago)
- Last Synced: 2024-08-03T02:03:37.484Z (6 months ago)
- Language: Python
- Size: 72.3 KB
- Stars: 10
- Watchers: 3
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-apache-airflow - ETL with Apache Airflow for Data Analysis on Transaction Data - thagana-4920b5181/) covers a practical case of doing an ETL process using Apache Airflow using a dummy ecommerce store's transactional, user and product data. The data is served via a flask API. (Introductions and tutorials)
README
# ETL-pipelines-airflow
Demonstrating and Building ETL pipelines in Airflow
This repo demonstrates a use case for n Ecommerce business that has a platform that generates transaction data each time a purchase is made. With this transaction data, the functions in the pipeline seek to answer 3 business questions1. Who is our platinum customer? Anyone with purchase value equal to or more than 5000
2. What is the purchase history like for each user? This builds a dataset that can be used for a *recommendation engine* downstream
3. What items are commonly purchased together? This builds a dataset that can be used for *Basket Analysis* downstream## Analysis Implementation
The code can be found in `etl_utils.py` file.
Question 1 is implemented using `pd.merge()` to get the combined dataset and `df.groupby().sum()` to get total purchases.To get the platinum customer, we apply a filter
`final_df = df.loc[df['total_purchase_value']>=10000]`
Both question 2 and 3 are achieved using Pandas **Pivot Tables** `pd.pivtot_table()`
# Generated Data
The sample CSVs generated by the above functions can be found in the `samples/` folder.# Airflow Connections
I have used the PostgresOperator which requires a postgres connection and a SimpleHttpOperator which requires a http connection. This is set in the Admin UI in the connections tab.