https://github.com/tuni56/ecommerce-data-warehouse-redshift
Designed and implemented a production-style ecommerce data warehouse on AWS using Redshift Serverless, with incremental pipelines, late-arriving data handling, star schema modeling, and explicit data quality controls. The system supports reprocessing, cost-aware analytics, and BI-ready consumption.
https://github.com/tuni56/ecommerce-data-warehouse-redshift
aws datawarehouse iac-terraform redshift serverless
Last synced: about 2 months ago
JSON representation
Designed and implemented a production-style ecommerce data warehouse on AWS using Redshift Serverless, with incremental pipelines, late-arriving data handling, star schema modeling, and explicit data quality controls. The system supports reprocessing, cost-aware analytics, and BI-ready consumption.
- Host: GitHub
- URL: https://github.com/tuni56/ecommerce-data-warehouse-redshift
- Owner: tuni56
- Created: 2026-01-25T11:45:54.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-02-10T19:02:33.000Z (2 months ago)
- Last Synced: 2026-02-10T22:07:01.929Z (2 months ago)
- Topics: aws, datawarehouse, iac-terraform, redshift, serverless
- Language: HCL
- Homepage:
- Size: 37.1 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
Awesome Lists containing this project
README
# Ecommerce Data Warehouse on Amazon Redshift Serverless
[](LICENSE)
[](infra/terraform)
[](analytics/dbt)
A production-grade data warehouse implementation for ecommerce analytics, built on AWS using modern data engineering practices. This project demonstrates end-to-end data pipeline design, from raw ingestion to analytics-ready dimensional models.
## Business Value
**For C-Level:**
- Single source of truth for ecommerce metrics (revenue, customer lifetime value, product performance)
- Serverless architecture reduces operational overhead and scales automatically with demand
- Cost-optimized design with pay-per-query pricing model
- Historical tracking enables trend analysis and forecasting
**For Technical Teams:**
- Medallion architecture (Bronze → Silver → Gold) ensures data quality and lineage
- Infrastructure as Code enables reproducible deployments across environments
- Incremental processing minimizes compute costs and latency
- Star schema design optimized for BI tool performance
## Architecture
The solution implements a modern lakehouse pattern with three distinct layers:
```mermaid
graph LR
A[Source Systems
OLTP Databases] -->|CDC/Batch Export| B[S3 Raw Zone
Bronze Layer]
B -->|AWS Glue ETL| C[S3 Staging Zone
Silver Layer]
C -->|COPY/Incremental Load| D[Redshift Serverless
Gold Layer]
D -->|SQL Queries| E[BI Tools
QuickSight/Tableau]
D -->|Ad-hoc Analysis| F[Data Analysts]
style B fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#fff
style C fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#fff
style D fill:#8c4fff,stroke:#232f3e,stroke-width:2px,color:#fff
style E fill:#232f3e,stroke:#ff9900,stroke-width:2px,color:#fff
```
**Key Components:**
- **Amazon S3**: Immutable raw data storage and staging layer
- **AWS Glue**: Serverless ETL for data cleaning and standardization
- **Amazon Redshift Serverless**: Analytics engine with star schema dimensional model
- **dbt**: SQL-based transformations with built-in testing and documentation
- **Terraform**: Infrastructure provisioning with modular, reusable components
[Detailed architecture diagrams →](docs/architecture-diagram.md) | [Architecture documentation →](docs/architecture.md)
## Data Model
Star schema optimized for analytical queries:
**Fact Table:**
- `fact_order_items` - Grain: one row per order item
**Dimensions:**
- `dim_customers` - SCD Type 2 for historical tracking
- `dim_products` - SCD Type 2 for price/attribute changes
- `dim_date` - Standard date dimension for time-series analysis
- `dim_payment_methods`
- `dim_shipment_status`
This design prevents double-counting, enables flexible slicing/dicing, and maintains historical accuracy for point-in-time reporting.
[Data model details →](docs/data_model.md)
## Technology Stack
| Layer | Technology | Purpose |
|-------|-----------|---------|
| Storage | Amazon S3 | Raw and staging data lake |
| ETL | AWS Glue | Serverless data processing |
| Warehouse | Redshift Serverless | Analytics engine |
| Transformation | dbt Core | SQL-based modeling & testing |
| IaC | Terraform | Infrastructure provisioning |
| Orchestration | AWS Step Functions | Workflow coordination |
| Monitoring | CloudWatch | Logging and alerting |
## Project Structure
```
.
├── analytics/
│ ├── dbt/ # dbt models, tests, and documentation
│ └── sql_examples/ # Sample analytical queries
├── infra/
│ ├── terraform/ # Infrastructure as Code
│ │ ├── modules/ # Reusable Terraform modules
│ │ └── environments/ # Environment-specific configs (dev/prod)
│ └── diagrams/ # Architecture diagrams
└── docs/ # Technical documentation
```
## Quick Start
### Prerequisites
- AWS Account with appropriate IAM permissions
- Terraform >= 1.5.0
- AWS CLI configured with credentials
- dbt Core >= 1.6.0
- Python >= 3.9
### Deployment
**1. Clone the repository**
```bash
git clone https://github.com//ecommerce-data-warehouse-redshift.git
cd ecommerce-data-warehouse-redshift
```
**2. Initialize Terraform**
```bash
cd infra/terraform/environments/dev
terraform init
```
**3. Review and apply infrastructure**
```bash
terraform plan
terraform apply
```
This provisions:
- S3 buckets (raw, staging, logs)
- Redshift Serverless namespace and workgroup
- Glue jobs and crawlers
- IAM roles and policies
- VPC and security groups
**4. Configure dbt**
```bash
cd analytics/dbt
cp profiles.yml.example profiles.yml
# Edit profiles.yml with your Redshift endpoint
```
**5. Run dbt transformations**
```bash
dbt deps
dbt run --target dev
dbt test
```
**6. Verify deployment**
```bash
# Query Redshift to confirm data loaded
aws redshift-data execute-statement \
--workgroup-name ecommerce-dwh-dev \
--database dev \
--sql "SELECT COUNT(*) FROM fact_order_items;"
```
## Key Features
### Infrastructure as Code
- Modular Terraform design for reusability across environments
- Separate state management per environment (dev/staging/prod)
- Automated resource tagging for cost allocation
### Data Quality
- dbt tests for uniqueness, referential integrity, and not-null constraints
- Row count reconciliation between layers
- Freshness checks for SLA monitoring
### Performance Optimization
- Distribution keys on fact tables for co-located joins
- Sort keys on date columns for time-series queries
- Incremental models to minimize full table scans
- Workload management (WLM) configuration for query prioritization
### Cost Management
- Redshift Serverless auto-scales based on workload
- S3 lifecycle policies for archival to Glacier
- Glue job bookmarks prevent reprocessing
- Development environment with reduced capacity
## Sample Analytics Queries
**Monthly Revenue Trend:**
```sql
SELECT
d.year_month,
SUM(f.total_amount) as revenue
FROM fact_order_items f
JOIN dim_date d ON f.order_date_key = d.date_key
GROUP BY 1
ORDER BY 1;
```
**Top Products by Revenue:**
```sql
SELECT
p.product_name,
SUM(f.quantity) as units_sold,
SUM(f.total_amount) as revenue
FROM fact_order_items f
JOIN dim_products p ON f.product_key = p.product_key
WHERE p.is_current = TRUE
GROUP BY 1
ORDER BY 3 DESC
LIMIT 10;
```
[More examples →](analytics/sql_examples/)
## Design Decisions
Key architectural choices and tradeoffs:
- **Redshift Serverless vs Provisioned**: Chose serverless for automatic scaling and simplified operations. Suitable for variable workloads with unpredictable query patterns.
- **Star Schema vs Data Vault**: Star schema prioritizes query simplicity and BI tool compatibility over extreme flexibility.
- **SCD Type 2 for dimensions**: Enables point-in-time analysis at the cost of increased storage and join complexity.
- **Glue vs EMR**: Glue's serverless model reduces operational burden for moderate data volumes (<10TB).
- **S3 as staging layer**: Decouples ingestion from transformation, enables reprocessing, and reduces Redshift storage costs.
[Full decision log →](docs/decisions.md)
## Monitoring & Observability
- **CloudWatch Dashboards**: Query performance, RPU consumption, data freshness
- **Glue Job Metrics**: Success rate, duration, DPU utilization
- **dbt Test Results**: Data quality KPIs tracked over time
- **Cost Alerts**: Budget thresholds for Redshift and Glue
## Development Workflow
This project follows GitFlow branching strategy:
- `main` - Production-ready code
- `develop` - Integration branch for features
- `feature/*` - Individual feature branches
- `hotfix/*` - Emergency production fixes
**Contributing:**
1. Create feature branch from `develop`
2. Implement changes with tests
3. Submit PR with description of changes
4. Merge to `develop` after review
5. Release to `main` when ready for production
## Roadmap
- [ ] CI/CD pipeline with GitHub Actions
- [ ] Incremental dbt models for large fact tables
- [ ] Real-time ingestion with Kinesis Data Firehose
- [ ] ML integration for customer churn prediction
- [ ] Cross-region disaster recovery
## Cost Estimation
**Development Environment** (monthly):
- Redshift Serverless: ~$50-100 (8 RPU-hours/day)
- S3 Storage: ~$5 (100GB)
- Glue Jobs: ~$20 (daily runs)
- **Total: ~$75-125/month**
**Production Environment** (monthly, estimated):
- Redshift Serverless: ~$500-1000 (depends on query load)
- S3 Storage: ~$50 (1TB)
- Glue Jobs: ~$100 (hourly incremental loads)
- **Total: ~$650-1150/month**
Use [AWS Pricing Calculator](https://calculator.aws) for detailed estimates based on your workload.
## License
MIT License - see [LICENSE](LICENSE) file for details.
## Author
**Rocio** - AWS Data Engineer
[GitHub](https://github.com/) | [LinkedIn](https://linkedin.com/in/)
---
*This project is a portfolio demonstration and not affiliated with any commercial entity.*