An open API service indexing awesome lists of open source software.

https://github.com/markphamm/jaffle-stripe-transformation-pipeline

dbt pipeline transforming Jaffle Shop (e-commerce) & Stripe data into analytics-ready models. Built with modular layers, tests, and docs.
https://github.com/markphamm/jaffle-stripe-transformation-pipeline

Last synced: about 2 months ago
JSON representation

dbt pipeline transforming Jaffle Shop (e-commerce) & Stripe data into analytics-ready models. Built with modular layers, tests, and docs.

Awesome Lists containing this project

README

        

# Jaffle Stripe Transformation Pipeline

A modern data transformation pipeline that integrates and transforms sample data from **Jaffle Shop** (a fictional e-commerce dataset) and **Stripe** (payment processing data) for analytics. Built with โค๏ธ using `dbt` (data build tool) and Airflow.

---
![diagram (1)](https://github.com/user-attachments/assets/5318a375-055b-4f1c-8db3-fa99277cd58c)

## ๐Ÿ“– Overview

This pipeline demonstrates how to:
1. **Extract** raw transactional data from Jaffle Shop (sample e-commerce data) and Stripe (payment gateway).
2. **Transform** the data into clean, analysis-ready datasets (e.g., customer lifetime value, payment analytics).
3. **Load** structured data into a data warehouse (e.g., Snowflake).

Ideal for learning data transformation patterns, idempotent modeling, and data quality testing with `dbt`.

## ๐Ÿ”„ Data Flow
![image](https://github.com/user-attachments/assets/9ee49239-d75c-41c3-8d28-88080796f6a7)
### Source Data
1. **Jaffle Shop** (`raw.jaffle_shop`)
- `customers`: Raw customer information
- `orders`: Order transactions with status tracking

2. **Stripe** (`raw.stripe`)
- `payment`: Payment processing data with status and amounts

### Data Models

#### Staging Layer (`models/staging/`)
- **Jaffle Shop**
- `stg_jaffle_shop__customers`: Cleaned customer data
- `stg_jaffle_shop__orders`: Standardized order information
- **Stripe**
- `stg_stripe__payments`: Normalized payment data (amounts converted to dollars)

#### Marts Layer (`models/marts/`)
- **Finance** (`marts/finance/`)
- `fct_orders`: Order facts with payment amounts
- Combines order information with successful payments
- Calculates total order amounts
- **Marketing** (`marts/marketing/`)
- `dim_customers`: Customer dimension with analytics
- First and most recent order dates
- Number of orders
- Customer lifetime value

## ๐Ÿงช Testing & Quality

### Data Tests
1. **Generic Tests**
- Primary key uniqueness
- Referential integrity
- Not null constraints
- Accepted values validation

2. **Custom Tests**
- `assert_positive_total_for_payments`: Ensures no negative payment totals

### Data Quality Checks
- **Freshness Checks**
- Orders: Warn after 12h, error after 24h
- Payments: Warn after 24h, error after 72h

### Documentation
- Comprehensive field-level documentation
- Value definitions for status fields
- Payment method explanations

## ๐Ÿ› ๏ธ Installation

### Prerequisites
- Python 3.8+
- `dbt-core` and `dbt-snowflake` adapter
- Snowflake account
- Apache Airflow
- Docker (for containerized deployment)

### Steps
1. **Clone the repository**:
```bash
git clone https://github.com/MarkPhamm/Jaffle-Stripe-Transformation-Pipeline.git
cd Jaffle-Stripe-Transformation-Pipeline
```

2. **Install dependencies**:
```bash
pip install -r requirements.txt
```

3. **Set up Snowflake connection** in `profiles.yml`:
```yaml
jaffle_shop:
target: dev
outputs:
dev:
type: snowflake
account: [your_account]
user: [username]
password: [password]
role: [role]
database: raw
warehouse: [warehouse]
schema: dbt_schema
```

4. **Configure Airflow Connection**:
- Connection ID: `snowflake_conn`
- Connection Type: `Snowflake`
- Schema: `dbt_schema`
- Login: Your Snowflake username
- Password: Your Snowflake password
- Extra: Configure account, warehouse, database, and role

5. **Run the project in Docker**:
- Build the Docker image:
```bash
docker build -t jaffle-stripe-pipeline .
```
- Run the Docker container:
```bash
docker run -d -p 8080:8080 jaffle-stripe-pipeline
```
- Access the Airflow UI at `http://localhost:8080`.

---

## โš™๏ธ Configuration

1. **Configure Data Sources**:
- **Jaffle Shop**: Sample data included as CSV files (loaded via `dbt seed`).
- **Stripe**: Set up Stripe API credentials in `stripe.yml` or environment variables.

2. **Update `dbt_project.yml`**:
- Adjust database/schema names and materialization settings.

---

## ๐Ÿš€ Usage

### Local Development
1. **Initialize dbt**:
```bash
cd dbt-dags/dags/dbt/jaffle_stripe
dbt deps
dbt seed
```

2. **Run transformations**:
```bash
dbt run
dbt test
```

3. **Generate documentation**:
```bash
dbt docs generate
dbt docs serve
```

### Airflow Orchestration
The pipeline is orchestrated using Airflow with the following features:
- Daily schedule (`@daily`)
- Automatic dependency installation
- Snowflake connection management
- Virtual environment isolation

## ๐Ÿ“Š Key Metrics

The transformed data enables analysis of:
1. Customer Lifetime Value (CLV)
2. Order Success Rates
3. Payment Method Distribution
4. Order Status Tracking
5. Customer Purchase Patterns

## ๐Ÿ” Data Definitions

### Order Status
| Status | Definition |
|--------|------------|
| placed | Order placed, not yet shipped |
| shipped | Order has been shipped |
| completed | Order received by customer |
| return_pending | Return requested |
| returned | Item returned |

### Payment Methods
| Method | Definition |
|--------|------------|
| credit_card | Credit card payment |
| coupon | Discount/promo coupon |
| bank_transfer | Direct bank transfer |
| gift_card | Gift card payment |

## ๐Ÿ“š Resources
- [dbt Documentation](https://docs.getdbt.com/docs/introduction)
- [Airflow Documentation](https://airflow.apache.org/docs/)
- [Snowflake Documentation](https://docs.snowflake.com/)

---

## ๐Ÿ“‚ Project Structure
```
jaffle-stripe-transformation-pipeline/
โ”œโ”€โ”€ README.md # Project documentation
โ”œโ”€โ”€ requirements.txt # Python dependencies
โ”œโ”€โ”€ .gitignore # Git ignore rules
โ””โ”€โ”€ dbt-dags/ # Main project directory
โ”œโ”€โ”€ README.md # DBT-specific documentation
โ”œโ”€โ”€ Dockerfile # Container configuration
โ”œโ”€โ”€ requirements.txt # DBT project dependencies
โ”œโ”€โ”€ airflow_settings.yaml # Airflow configuration
โ”œโ”€โ”€ dags/ # Airflow DAGs directory
โ”‚ โ”œโ”€โ”€ transformation_dag.py # Main transformation DAG
โ”‚ โ””โ”€โ”€ dbt/ # DBT project files
โ”‚ โ””โ”€โ”€ jaffle_stripe/ # Main DBT project
โ”‚ โ”œโ”€โ”€ dbt_project.yml # DBT project configuration
โ”‚ โ”œโ”€โ”€ packages.yml # External package dependencies
โ”‚ โ”œโ”€โ”€ models/ # Data transformation models
โ”‚ โ”‚ โ”œโ”€โ”€ staging/ # Raw data models
โ”‚ โ”‚ โ”œโ”€โ”€ intermediate/ # Intermediate transformations
โ”‚ โ”‚ โ””โ”€โ”€ marts/ # Final presentation layer
โ”‚ โ”œโ”€โ”€ macros/ # Reusable SQL macros
โ”‚ โ”œโ”€โ”€ tests/ # Custom data tests
โ”‚ โ”œโ”€โ”€ seeds/ # Static data files
โ”‚ โ”œโ”€โ”€ snapshots/ # Type 2 SCD tracking
โ”‚ โ””โ”€โ”€ analyses/ # Ad-hoc analyses
โ”œโ”€โ”€ tests/ # Airflow tests
โ”œโ”€โ”€ plugins/ # Airflow plugins
โ””โ”€โ”€ include/ # Additional resources

```

### ๐Ÿ“ Directory Overview

- **`models/`**: Contains all dbt data models organized in layers:
- `staging/`: Initial data models that clean and standardize raw data
- `intermediate/`: Complex transformations and business logic
- `marts/`: Final presentation layer for business users

- **`macros/`**: Reusable SQL snippets and utility functions
- **`tests/`**: Custom data quality tests and assertions
- **`seeds/`**: Static CSV files for reference data
- **`snapshots/`**: Type 2 Slowly Changing Dimension (SCD) tracking
- **`analyses/`**: One-off analytical queries and explorations

### ๐Ÿ”„ Airflow Integration

The project uses Apache Airflow for orchestration:
- `dags/transformation_dag.py`: Defines the main ETL pipeline
- `airflow_settings.yaml`: Configures Airflow environment
- `Dockerfile`: Sets up containerized deployment

---