Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/fa3001/retail-data-pipeline

Last synced: 21 days ago
JSON representation

Host: GitHub
URL: https://github.com/fa3001/retail-data-pipeline
Owner: FA3001
Created: 2024-07-11T07:04:47.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-07-12T14:25:24.000Z (6 months ago)
Last Synced: 2024-07-13T14:50:27.986Z (6 months ago)
Language: Python
Size: 7.05 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Retail Data Pipeline WITH AIRFLOW POSTGRES DBT SODA

## Objective
The goal of this project is to create an end-to-end data pipeline from a Kaggle dataset on retail data. This involves modeling the data into fact-dimension tables, implementing data quality steps, utilizing modern data stack technologies (dbt, Soda, and Airflow), and storing the data in Postgres The project is containerized via Docker and versioned on GitHub.
- Data Ingestion: We’ll start by extracting data from a CSV file and loading it into a PostgreSQL database, instead of snowflake.
- Quality Checks: Using Soda, we’ll perform data quality checks to ensure data integrity and completeness.
- Data Transformation: We’ll use dbt (Data Build Tool) to transform the raw data into structured, analysis-ready data models.
- Orchestration: Apache Airflow will be used to automate and manage our data workflows, ensuring smooth and timely execution.

# 🌟 System Architecture
![image](https://github.com/user-attachments/assets/c8f8ef68-063a-487e-a65b-aef19b17cfb6)

# 📁 Repository Structure
```shell
├── dags
│ ├── creation_table_country.sql
│ ├── postgres_ingest_data.py
├── include
│ ├── dataset
│ │ └── online_retail.csv
│ ├── dbt
│ │ ├── dbt_packages
│ │ ├── models
│ │ │ ├── report
│ │ │ ├── sources
│ │ │ └── transform
│ │ ├── cosmos_config.py
│ │ ├── dbt_project.yml
│ │ ├── packages.yml
│ │ └── profiles.yml
│ ├── soda
│ │ ├── checks
│ │ │ ├── report
│ │ │ ├── sources
│ │ │ └── transform
│ │ ├── check_function.py
│ │ └── configuration.yml
├── logs
├── plugins
├── .env
├── airflow.cfg
├── docker-compose.yml
└── Dockerfile
```

## Data modeling
![image](https://github.com/user-attachments/assets/c4b5421f-84ad-42bd-9edf-103b642034f0)

## 🚀 Getting Started
1. **Clone the repository**:

```bash
git clone https://github.com/FA3001/ELT-pipeline
```
2. **Launch Docker**
```bash
docker-compose up -d
```
3. **Download Data**
```bash
/include/datasets/online_retail.csv
```
4. **API key for Soda Cloud**
Create an account on soda.io and generating the API key and it's associated secret, put them in
```bash
include/soda/configuration.yml
```
5. **Initialize DBT configuration within the scheduler**
```bash
docker ps
docker exec -it bash
```
6. **Activate the dbt_venv**
```bash
source dbt_venv/bin/activate
```
7. **Go to include/dbt**
```bash
cd include/dbt
```
8. **Install DBT dependencies**
```bash
dbt deps
```
9. **Access Airflow UI**: **Uesername and Pass: airflow**
```bash
http://localhost:8080/
```