Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/longnguyen010203/ecommerce-elt-pipeline

๐ŸŒ„๐Ÿ“ˆ๐Ÿ“‰ A Data Engineering Project ๐ŸŒˆ that implements an ELT data pipeline using Dagster, Docker, Dbt, Polars, Snowflake, PostgreSQL. Data from kaggle website ๐Ÿ”ฅ
https://github.com/longnguyen010203/ecommerce-elt-pipeline

dagster data data-engineering dbt docker docker-compose dockerfile elt elt-pipeline extract kaggle load polars postgresql raw-data relational-databases snowflake transform

Last synced: 2 months ago
JSON representation

๐ŸŒ„๐Ÿ“ˆ๐Ÿ“‰ A Data Engineering Project ๐ŸŒˆ that implements an ELT data pipeline using Dagster, Docker, Dbt, Polars, Snowflake, PostgreSQL. Data from kaggle website ๐Ÿ”ฅ

Awesome Lists containing this project

README

        

# ๐ŸŒ‡ ECommerce-ELT-Pipeline

In this project, I build a simple data pipeline following the ELT(extract - load - transform) model using the Brazilian-Ecommerce dataset, perform data processing and transformation, serve to create reports, in-depth analysis and support for the Data Analyst team

## ๐Ÿ”ฆ About Project

#### 1. Pipeline Design

- **Data Source**: The project uses the [Brazilian Ecommerce](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce) public dataset by Olist, downloaded from [kaggle.com](https://www.kaggle.com) in `.csv` format.
- The 5 csv files are loaded into `PostgreSQL`, considering it a data source.
- The remaining 4 csv files are extracted directly.
- **Extract Data**: Data is extracted using `Polars` as a `DataFrame` from a `PostgreSQL` database and `CSV` file.
- **Load Data**: After extracting data from the above two data sources, we load it into `Snowflake` at `raw` layer from `Polars` `DataFrame`.
- **Tranform Data**: After loading the data, we perform `transform` with `dbt` on `Snowflake` to create `dimension` and `fact` tables in the `staging` layer and calculate aggregates in the `mart` layer.
- **Serving**: Data is served for `reporting`, `analysis`, and `decision support` using `Metabase` and `Apache Superset`.
- **package and orchestrator**: The entire project is packaged and orchestrated by `Docker` and `Dagster`.

#### 2. Data File Relationships

- **olist_geolocation_dataset**: This dataset has information Brazilian zip codes and its lat/lng coordinates.
- **olist_customers_dataset**: This dataset has information about the customer and its location.
- **olist_order_items_dataset**: This dataset includes data about the items purchased within each order.
- **olist_order_payments_dataset**: This dataset includes data about the orders payment options.
- **olist_order_reviews_dataset**: This dataset includes data about the reviews made by the customers.
- **olist_orders_dataset**: This is the core dataset. From each order you might find all other information.
- **olist_products_dataset**: This dataset includes data about the products sold by Olist.
- **olist_sellers_dataset**: This dataset includes data about the sellers that fulfilled orders made at Olist.

#### 3. Data Lineage

Graph Lineage (dagster) trong dแปฑ รกn nร y bao gแป“m 4 layer:
- **source layer**: This layer contains `assets` that `collect` data from `PostgreSQL` and `CSV` files using `Polars` `DataFrame`.
- **raw layer**: This layer contains `assets` that perform the task of loading data from `Polars` `DataFrame` into `Snowflake` warehouse in `raw` schema.
- **staging layer**: This layer contains assets that handle data transformation from the `raw` schema, then the data is put into the `staging` schema.
- **mart layer**: This layer contains `assets` that are responsible for synthesizing calculations from data in the `staging` schema and then putting the data into the `mart` schema.

## ๐Ÿ“ฆ Technologies

- `PostgreSQL`
- `Polars`
- `Dbt`
- `Dagster`
- `Snowflake`
- `Docker`
- `Metabase`
- `Apache Superset`

## ๐Ÿฆ„ Features

Here's what you can do with NinjaSketch:
- You can completely change the logic or create new `assets` in the `data pipeline` as you wish, perform `aggregate` `calculations` on the `assets` in the `pipeline` according to your purposes.
- You can also create new `data charts` as well as change existing `charts` as you like with extremely diverse `chart types` on `Metabase` and `Apache Superset`.
- You can also create new or change my existing `dashboards` as you like

## ๐Ÿ‘ฉ๐Ÿฝโ€๐Ÿณ The Process

## ๐Ÿ“š What I Learned

## ๐Ÿ’ญ How can it be improved?

- Add more `data sources` to increase data richness.
- Refer to other `data warehouses` besides `Snowflake` such as `Amazon Redshift` or `Google Bigquery`.
- Perform more `cleaning` and `optimization` `processing` of the data.
- Perform more advanced `statistics`, `analysis` and `calculations`.
- Check out other popular and popular `data orchestration` tools like `Apache Airflow`.
- Separate `dbt` into a separate service (separate `container`) in `docker` when the project expands
- Learn about `dbt packages` like `dbt-labs/dbt_utils` to help make the `transformation` process faster and more optimal.

## ๐Ÿšฆ Running the Project

To run the project in your local environment, follow these steps:
1. Run `git clone https://github.com/longNguyen010203/ECommerce-ELT-Pipeline.git` to clone the repository to your local machine.
2. run `make build` to build the images from the Dockerfile
3. run `make up` to pull images from docker hub and launch services
4. run `make psql_create` to create tables with schema for PostgreSQL
5. run `make psql_import` to load data from CSV file to PostgreSQL
6. Open [http://localhost:3001](http://localhost:3001) and click `Materialize all` button to run the Pipeline
7. Open [https://app.snowflake.com](https://app.snowflake.com) and login to check and monitor updated data
8. Open [http://localhost:3030](http://localhost:3030) to see charts and dashboards

## ๐Ÿฟ Video
[Demo.webm](https://github.com/longNguyen010203/ECommerce-ELT-Pipeline/assets/168116061/c22f700c-7b4f-4c05-bfac-7fb6943a0338)