https://github.com/longnguyen010203/ecommerce-elt-pipeline

🌄📈📉 A Data Engineering Project 🌈 that implements an ELT data pipeline using Dagster, Docker, Dbt, Polars, Snowflake, PostgreSQL. Data from kaggle website 🔥
https://github.com/longnguyen010203/ecommerce-elt-pipeline

dagster data data-engineering dbt docker docker-compose dockerfile elt elt-pipeline extract kaggle load polars postgresql raw-data relational-databases snowflake transform

Last synced: 16 days ago
JSON representation

🌄📈📉 A Data Engineering Project 🌈 that implements an ELT data pipeline using Dagster, Docker, Dbt, Polars, Snowflake, PostgreSQL. Data from kaggle website 🔥

Host: GitHub
URL: https://github.com/longnguyen010203/ecommerce-elt-pipeline
Owner: longNguyen010203
License: mit
Created: 2024-06-10T16:36:24.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-06-14T18:23:17.000Z (over 1 year ago)
Last Synced: 2024-11-22T23:33:32.219Z (11 months ago)
Topics: dagster, data, data-engineering, dbt, docker, docker-compose, dockerfile, elt, elt-pipeline, extract, kaggle, load, polars, postgresql, raw-data, relational-databases, snowflake, transform
Language: Python
Homepage:
Size: 6.84 MB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# 🌇 ECommerce-ELT-Pipeline

In this project, I build a simple data pipeline following the ELT(extract - load - transform) model using the Brazilian-Ecommerce dataset, perform data processing and transformation, serve to create reports, in-depth analysis and support for the Data Analyst team

## 🔦 About Project

#### 1. Pipeline Design

- **Data Source**: The project uses the [Brazilian Ecommerce](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce) public dataset by Olist, downloaded from [kaggle.com](https://www.kaggle.com) in `.csv` format.
- The 5 csv files are loaded into `PostgreSQL`, considering it a data source.
- The remaining 4 csv files are extracted directly.
- **Extract Data**: Data is extracted using `Polars` as a `DataFrame` from a `PostgreSQL` database and `CSV` file.
- **Load Data**: After extracting data from the above two data sources, we load it into `Snowflake` at `raw` layer from `Polars` `DataFrame`.
- **Tranform Data**: After loading the data, we perform `transform` with `dbt` on `Snowflake` to create `dimension` and `fact` tables in the `staging` layer and calculate aggregates in the `mart` layer.
- **Serving**: Data is served for `reporting`, `analysis`, and `decision support` using `Metabase` and `Apache Superset`.
- **package and orchestrator**: The entire project is packaged and orchestrated by `Docker` and `Dagster`.

#### 2. Data File Relationships

- **olist_geolocation_dataset**: This dataset has information Brazilian zip codes and its lat/lng coordinates.
- **olist_customers_dataset**: This dataset has information about the customer and its location.
- **olist_order_items_dataset**: This dataset includes data about the items purchased within each order.
- **olist_order_payments_dataset**: This dataset includes data about the orders payment options.
- **olist_order_reviews_dataset**: This dataset includes data about the reviews made by the customers.
- **olist_orders_dataset**: This is the core dataset. From each order you might find all other information.
- **olist_products_dataset**: This dataset includes data about the products sold by Olist.
- **olist_sellers_dataset**: This dataset includes data about the sellers that fulfilled orders made at Olist.

#### 3. Data Lineage

Graph Lineage (dagster) trong dự án này bao gồm 4 layer:
- **source layer**: This layer contains `assets` that `collect` data from `PostgreSQL` and `CSV` files using `Polars` `DataFrame`.
- **raw layer**: This layer contains `assets` that perform the task of loading data from `Polars` `DataFrame` into `Snowflake` warehouse in `raw` schema.
- **staging layer**: This layer contains assets that handle data transformation from the `raw` schema, then the data is put into the `staging` schema.
- **mart layer**: This layer contains `assets` that are responsible for synthesizing calculations from data in the `staging` schema and then putting the data into the `mart` schema.

## 📦 Technologies

- `PostgreSQL`
- `Polars`
- `Dbt`
- `Dagster`
- `Snowflake`
- `Docker`
- `Metabase`
- `Apache Superset`

## 🦄 Features

Here's what you can do with NinjaSketch:
- You can completely change the logic or create new `assets` in the `data pipeline` as you wish, perform `aggregate` `calculations` on the `assets` in the `pipeline` according to your purposes.
- You can also create new `data charts` as well as change existing `charts` as you like with extremely diverse `chart types` on `Metabase` and `Apache Superset`.
- You can also create new or change my existing `dashboards` as you like

## 👩🏽‍🍳 The Process

## 📚 What I Learned

## 💭 How can it be improved?

- Add more `data sources` to increase data richness.
- Refer to other `data warehouses` besides `Snowflake` such as `Amazon Redshift` or `Google Bigquery`.
- Perform more `cleaning` and `optimization` `processing` of the data.
- Perform more advanced `statistics`, `analysis` and `calculations`.
- Check out other popular and popular `data orchestration` tools like `Apache Airflow`.
- Separate `dbt` into a separate service (separate `container`) in `docker` when the project expands
- Learn about `dbt packages` like `dbt-labs/dbt_utils` to help make the `transformation` process faster and more optimal.

## 🚦 Running the Project

To run the project in your local environment, follow these steps:
1. Run `git clone https://github.com/longNguyen010203/ECommerce-ELT-Pipeline.git` to clone the repository to your local machine.
2. run `make build` to build the images from the Dockerfile
3. run `make up` to pull images from docker hub and launch services
4. run `make psql_create` to create tables with schema for PostgreSQL
5. run `make psql_import` to load data from CSV file to PostgreSQL
6. Open [http://localhost:3001](http://localhost:3001) and click `Materialize all` button to run the Pipeline
7. Open [https://app.snowflake.com](https://app.snowflake.com) and login to check and monitor updated data
8. Open [http://localhost:3030](http://localhost:3030) to see charts and dashboards

## 🍿 Video
[Demo.webm](https://github.com/longNguyen010203/ECommerce-ELT-Pipeline/assets/168116061/c22f700c-7b4f-4c05-bfac-7fb6943a0338)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/longnguyen010203/ecommerce-elt-pipeline

Awesome Lists containing this project

README