Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/derak-isaack/dtl-dbt-pipeline

A data pipeline extracts data from AWS S3 bucket, Load the data in an embedded database and finally applies transformation to the data.
https://github.com/derak-isaack/dtl-dbt-pipeline

Last synced: about 1 month ago
JSON representation

A data pipeline extracts data from AWS S3 bucket, Load the data in an embedded database and finally applies transformation to the data.

Host: GitHub
URL: https://github.com/derak-isaack/dtl-dbt-pipeline
Owner: derak-isaack
Created: 2024-04-18T23:20:40.000Z (9 months ago)
Default Branch: master
Last Pushed: 2024-04-26T00:25:34.000Z (8 months ago)
Last Synced: 2024-04-26T01:29:11.161Z (8 months ago)
Language: Python
Size: 4.43 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

MALL SHOPPING DATA ANALYSIS DATA PIPELINE

![dbt](https://img.shields.io/badge/dbt-FF694B?logo=dbt&logoColor=fff&style=for-the-badge)
![aws](https://img.shields.io/badge/Amazon%20AWS-232F3E?logo=amazonaws&logoColor=fff&style=for-the-badge)
![awsS3](https://img.shields.io/badge/Amazon%20S3-569A31?logo=amazons3&logoColor=fff&style=for-the-badge)
![streamlit](https://img.shields.io/badge/Streamlit-FF4B4B?logo=streamlit&logoColor=fff&style=for-the-badge)
![Duckdb](https://img.shields.io/badge/DuckDB-FFF000?logo=duckdb&logoColor=000&style=for-the-badge)
![sql](https://img.shields.io/badge/SQLite-003B57?logo=sqlite&logoColor=fff&style=for-the-badge)
![Docker](https://img.shields.io/badge/Docker-2496ED?logo=docker&logoColor=fff&style=for-the-badge)

###

Project Overview

ETL and ELT data pipelines offer simplicity and automation for organizations that use cloud storage where data scales exponentially. The data pipelines enables data for analysts and modelling is readily available. Automation of data pipelines also ensures real-time tracking of important `Key Perfomance Indicators` for a business. Exponentially growing data may sometimes bring the challenge of storage constraints and embedded databases solve this efficiently.

###

Objectives

1. Develop and deploy an efficient `ETL Data Pipeline` extracting data from `AWS S3 bucket` using [data load tool](https://dlthub.com/) using [DuckDB](https://duckdb.org/docs/index) database, `transforming` it and finally `loading` it in an `DuckDB` embedded database.

2. Prepare data for analysis using queries which can be found `[here](customersales/models)'. The queries seek to find:

* `Average age of all shoppers across various shopping malls.`

* `Average expenditure of each category of items by gender.`

* `Most profitable shopping malls by year and month.`

* `Shopping patterns across various malls.`

The extracted data from the queries can be found [here](customersales/seeds).

###

Data Extraction

The `Data Load Tool` open source library offers very efficient data pipelines offering consumption from various data sources including `Google Sheets, Cloud Storage services, APIs` and `scraping scripts`. This tool comes in handy for `e-commerce` organizations as it has the ability to track streaming data in real-time. It offers more scalability and customization depending on the needs at hand.

Data for this project was ingested from `AWS S3` bucket. To get started with setting up the credentials to allow access by third oarty applications to your data, reading [dlt documentation](https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#aws-s3) provides a clear step by step process for the setup.

After setting up everything, run the following commands in order to get started with `dlt`:

1. `pip install requirements.txt`

2. `dlt init filesystem duckdb`

3. To run your pipeline, run `python filesystem.py` in the terminal. This downloads a duckdb file in the working directory which contains information about the data source.

4. To have an interaction with the data, run `dlt pipeline filesystem show` which opens streamlit on the local browser.

For a more conclusive and extensive approach, reading [data sources](https://dlthub.com/docs/dlt-ecosystem/verified-sources/) provides more information on how to connect to several data sources.

###

Data Transformation

Transformation of data is necessary because it needs to be error free for use by the data team to extract meaningful insights more fast than the old traditional formarts. Transformation is done using the `Data Build Tool` which is an open source platform that offers more flexibility for transformation of data in data pipelines.

To open up the `Streamlit dashboard` to get a sneak peak of data and run simple queries, run the line `dlt pipeline filesystem show`.

This tool leverages on using `Jinja` to simplify the process of creating sql queries to handle various data objectives. It implements `DRY`-`Dont Repeat Yourself` approach which allows inheritance of `sql tables and queries`.

###

Data Loading

`DuckDb` is an embedded database which has better computational power at handling big datasets and also creates them on the fly without necessarilly loading it in `RDBMs`. It also offers compression of big files therefore coming in handy in the event of storage inefficiencies.

It can also be connected to visualization and dashboarding tools after installation of required drivers connections.

###

Streamlit dashboard

To create an interactive end-to-end project, a streamlit dashboard provides an efficient `User Interaction & Experience` because of its simple navigable page. The dashboard has filters for:

1. `Shopping mall`

2. `Year`

3. `Month`

4. `Product category`

5. `Gender` & `Payment method`

With each filter, the visuals adjust accordingly to the filter applied. The application can be found (here)[https://dtl-dbt-pipeline-gyz2smtjzi8bdoqqfptsi8.streamlit.app/] with this ![snapshot]()

###

Future steps

Integrate an `authorization` and `login page` to ensure different users access different contents and analysis of the application.