Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/longnguyen010203/ecommerce-elt-pipeline
๐๐๐ A Data Engineering Project ๐ that implements an ELT data pipeline using Dagster, Docker, Dbt, Polars, Snowflake, PostgreSQL. Data from kaggle website ๐ฅ
https://github.com/longnguyen010203/ecommerce-elt-pipeline
dagster data data-engineering dbt docker docker-compose dockerfile elt elt-pipeline extract kaggle load polars postgresql raw-data relational-databases snowflake transform
Last synced: 30 days ago
JSON representation
๐๐๐ A Data Engineering Project ๐ that implements an ELT data pipeline using Dagster, Docker, Dbt, Polars, Snowflake, PostgreSQL. Data from kaggle website ๐ฅ
- Host: GitHub
- URL: https://github.com/longnguyen010203/ecommerce-elt-pipeline
- Owner: longNguyen010203
- License: mit
- Created: 2024-06-10T16:36:24.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-06-14T18:23:17.000Z (6 months ago)
- Last Synced: 2024-06-15T03:00:08.770Z (6 months ago)
- Topics: dagster, data, data-engineering, dbt, docker, docker-compose, dockerfile, elt, elt-pipeline, extract, kaggle, load, polars, postgresql, raw-data, relational-databases, snowflake, transform
- Language: Python
- Homepage:
- Size: 6.84 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ๐ ECommerce-ELT-Pipeline
In this project, I build a simple data pipeline following the ELT(extract - load - transform) model using the Brazilian-Ecommerce dataset, perform data processing and transformation, serve to create reports, in-depth analysis and support for the Data Analyst team
## ๐ฆ About Project
#### 1. Pipeline Design
- **Data Source**: The project uses the [Brazilian Ecommerce](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce) public dataset by Olist, downloaded from [kaggle.com](https://www.kaggle.com) in `.csv` format.
- The 5 csv files are loaded into `PostgreSQL`, considering it a data source.
- The remaining 4 csv files are extracted directly.
- **Extract Data**: Data is extracted using `Polars` as a `DataFrame` from a `PostgreSQL` database and `CSV` file.
- **Load Data**: After extracting data from the above two data sources, we load it into `Snowflake` at `raw` layer from `Polars` `DataFrame`.
- **Tranform Data**: After loading the data, we perform `transform` with `dbt` on `Snowflake` to create `dimension` and `fact` tables in the `staging` layer and calculate aggregates in the `mart` layer.
- **Serving**: Data is served for `reporting`, `analysis`, and `decision support` using `Metabase` and `Apache Superset`.
- **package and orchestrator**: The entire project is packaged and orchestrated by `Docker` and `Dagster`.#### 2. Data File Relationships
- **olist_geolocation_dataset**: This dataset has information Brazilian zip codes and its lat/lng coordinates.
- **olist_customers_dataset**: This dataset has information about the customer and its location.
- **olist_order_items_dataset**: This dataset includes data about the items purchased within each order.
- **olist_order_payments_dataset**: This dataset includes data about the orders payment options.
- **olist_order_reviews_dataset**: This dataset includes data about the reviews made by the customers.
- **olist_orders_dataset**: This is the core dataset. From each order you might find all other information.
- **olist_products_dataset**: This dataset includes data about the products sold by Olist.
- **olist_sellers_dataset**: This dataset includes data about the sellers that fulfilled orders made at Olist.#### 3. Data Lineage
Graph Lineage (dagster) trong dแปฑ รกn nร y bao gแปm 4 layer:
- **source layer**: This layer contains `assets` that `collect` data from `PostgreSQL` and `CSV` files using `Polars` `DataFrame`.
- **raw layer**: This layer contains `assets` that perform the task of loading data from `Polars` `DataFrame` into `Snowflake` warehouse in `raw` schema.
- **staging layer**: This layer contains assets that handle data transformation from the `raw` schema, then the data is put into the `staging` schema.
- **mart layer**: This layer contains `assets` that are responsible for synthesizing calculations from data in the `staging` schema and then putting the data into the `mart` schema.## ๐ฆ Technologies
- `PostgreSQL`
- `Polars`
- `Dbt`
- `Dagster`
- `Snowflake`
- `Docker`
- `Metabase`
- `Apache Superset`## ๐ฆ Features
Here's what you can do with NinjaSketch:
- You can completely change the logic or create new `assets` in the `data pipeline` as you wish, perform `aggregate` `calculations` on the `assets` in the `pipeline` according to your purposes.
- You can also create new `data charts` as well as change existing `charts` as you like with extremely diverse `chart types` on `Metabase` and `Apache Superset`.
- You can also create new or change my existing `dashboards` as you like## ๐ฉ๐ฝโ๐ณ The Process
## ๐ What I Learned
## ๐ญ How can it be improved?
- Add more `data sources` to increase data richness.
- Refer to other `data warehouses` besides `Snowflake` such as `Amazon Redshift` or `Google Bigquery`.
- Perform more `cleaning` and `optimization` `processing` of the data.
- Perform more advanced `statistics`, `analysis` and `calculations`.
- Check out other popular and popular `data orchestration` tools like `Apache Airflow`.
- Separate `dbt` into a separate service (separate `container`) in `docker` when the project expands
- Learn about `dbt packages` like `dbt-labs/dbt_utils` to help make the `transformation` process faster and more optimal.## ๐ฆ Running the Project
To run the project in your local environment, follow these steps:
1. Run `git clone https://github.com/longNguyen010203/ECommerce-ELT-Pipeline.git` to clone the repository to your local machine.
2. run `make build` to build the images from the Dockerfile
3. run `make up` to pull images from docker hub and launch services
4. run `make psql_create` to create tables with schema for PostgreSQL
5. run `make psql_import` to load data from CSV file to PostgreSQL
6. Open [http://localhost:3001](http://localhost:3001) and click `Materialize all` button to run the Pipeline
7. Open [https://app.snowflake.com](https://app.snowflake.com) and login to check and monitor updated data
8. Open [http://localhost:3030](http://localhost:3030) to see charts and dashboards## ๐ฟ Video
[Demo.webm](https://github.com/longNguyen010203/ECommerce-ELT-Pipeline/assets/168116061/c22f700c-7b4f-4c05-bfac-7fb6943a0338)