https://github.com/alexuscr-27/amazon-data-etl
ETL pipeline for Amazon product sales data, using Apache Airflow for data orchestration and Supabase for storage, by containerizing the environment with Docker, the setup is scalable and easily deployable, supporting data-driven decision-making.
https://github.com/alexuscr-27/amazon-data-etl
airflow airflow-docker data-engineering data-visualization docker postgresql powerbi python3 supabase
Last synced: 2 months ago
JSON representation
ETL pipeline for Amazon product sales data, using Apache Airflow for data orchestration and Supabase for storage, by containerizing the environment with Docker, the setup is scalable and easily deployable, supporting data-driven decision-making.
- Host: GitHub
- URL: https://github.com/alexuscr-27/amazon-data-etl
- Owner: ALEXUSCR-27
- License: apache-2.0
- Created: 2024-09-27T05:12:03.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-03T19:55:05.000Z (over 1 year ago)
- Last Synced: 2025-04-03T17:52:46.379Z (about 1 year ago)
- Topics: airflow, airflow-docker, data-engineering, data-visualization, docker, postgresql, powerbi, python3, supabase
- Language: Python
- Homepage:
- Size: 4.49 MB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README





## About the project
This project focuses on building an ETL (Extract, Transform, Load) pipeline to process and analyze Amazon products data from a CSV file as a source. Using Apache Airflow for orchestration and automation of the operations, and Supabase for database management. This project efficiently organizes data for Business Intelligence (BI) and analysis, for this we use Google Looker Studio to generate visualizations and charts based on the processed data with some attributes like `Product name`, `Categories`, `Subcategories`, `Prices`, `Ratings` and `Reviews`.
## Features
- Dataset: The dataset can be fount in the following link [here](https://www.kaggle.com/datasets/karkavelrajaj/amazon-sales-dataset), it is also included in the data/raw folder of the repository.
- ETL Pipeline Automation: Orchestrate data extraction, transformation and loading processes using Apache Airflow, containerized with Docker to simplify deployment and ensure consistency across environments.
- Data Cleansing and Transformation: Processes raw CSV data into a structured format, cleaning and preparing different attributes like `Prices`, `Discount percentage`, `Ratings`, `Ratings count`, `Category` and creating new columns like `Sub-categories`.
- Database Management: Manages storage and retrieval of transformed data, optimizing it for Business Intelligence (BI) using Supabase services with PostgreSQL as the database.
- Business Intelligence Visualizations: Generates dynamic charts and reports in Google Looker Studio for deeper insights, including visualizations of popular product categories, rating trends, pricing distributions, and discount analytics.
## Getting started
- Clone the repository
```sh
git clone https://github.com/ALEXUSCR-27/Amazon-Data-ETL.git
cd Amazon-Data-ETL
```
- Config your airflow credentials in `airflow-init` configuration option in the `docker-compose` file
```
_AIRFLOW_WWW_USER_USERNAME: your_username
_AIRFLOW_WWW_USER_PASSWORD: your_password
```
- Build the docker container
```
docker run -p 8080:8080 Dockerfile
docker-compose up
```
- Access to airflow
Open browser and go to `http://localhost:8080` and sign in with your credentials.