{"id":45965171,"url":"https://github.com/tuni56/ecommerce-data-warehouse-redshift","last_synced_at":"2026-02-28T14:10:56.886Z","repository":{"id":337683222,"uuid":"1141763487","full_name":"tuni56/ecommerce-data-warehouse-redshift","owner":"tuni56","description":"Designed and implemented a production-style ecommerce data warehouse on AWS using Redshift Serverless, with incremental pipelines, late-arriving data handling, star schema modeling, and explicit data quality controls. The system supports reprocessing, cost-aware analytics, and BI-ready consumption.","archived":false,"fork":false,"pushed_at":"2026-02-10T19:02:33.000Z","size":38,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-10T22:07:01.929Z","etag":null,"topics":["aws","datawarehouse","iac-terraform","redshift","serverless"],"latest_commit_sha":null,"homepage":"","language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tuni56.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-25T11:45:54.000Z","updated_at":"2026-02-10T19:02:38.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/tuni56/ecommerce-data-warehouse-redshift","commit_stats":null,"previous_names":["tuni56/ecommerce-data-warehouse-redshift"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/tuni56/ecommerce-data-warehouse-redshift","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tuni56%2Fecommerce-data-warehouse-redshift","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tuni56%2Fecommerce-data-warehouse-redshift/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tuni56%2Fecommerce-data-warehouse-redshift/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tuni56%2Fecommerce-data-warehouse-redshift/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tuni56","download_url":"https://codeload.github.com/tuni56/ecommerce-data-warehouse-redshift/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tuni56%2Fecommerce-data-warehouse-redshift/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29936853,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-28T13:49:17.081Z","status":"ssl_error","status_checked_at":"2026-02-28T13:48:50.396Z","response_time":90,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","datawarehouse","iac-terraform","redshift","serverless"],"created_at":"2026-02-28T14:10:56.289Z","updated_at":"2026-02-28T14:10:56.877Z","avatar_url":"https://github.com/tuni56.png","language":"HCL","readme":"# Ecommerce Data Warehouse on Amazon Redshift Serverless\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\n[![Terraform](https://img.shields.io/badge/IaC-Terraform-623CE4?logo=terraform)](infra/terraform)\n[![dbt](https://img.shields.io/badge/Transform-dbt-FF694B?logo=dbt)](analytics/dbt)\n\nA production-grade data warehouse implementation for ecommerce analytics, built on AWS using modern data engineering practices. This project demonstrates end-to-end data pipeline design, from raw ingestion to analytics-ready dimensional models.\n\n## Business Value\n\n**For C-Level:**\n- Single source of truth for ecommerce metrics (revenue, customer lifetime value, product performance)\n- Serverless architecture reduces operational overhead and scales automatically with demand\n- Cost-optimized design with pay-per-query pricing model\n- Historical tracking enables trend analysis and forecasting\n\n**For Technical Teams:**\n- Medallion architecture (Bronze → Silver → Gold) ensures data quality and lineage\n- Infrastructure as Code enables reproducible deployments across environments\n- Incremental processing minimizes compute costs and latency\n- Star schema design optimized for BI tool performance\n\n## Architecture\n\nThe solution implements a modern lakehouse pattern with three distinct layers:\n\n```mermaid\ngraph LR\n    A[Source Systems\u003cbr/\u003eOLTP Databases] --\u003e|CDC/Batch Export| B[S3 Raw Zone\u003cbr/\u003eBronze Layer]\n    B --\u003e|AWS Glue ETL| C[S3 Staging Zone\u003cbr/\u003eSilver Layer]\n    C --\u003e|COPY/Incremental Load| D[Redshift Serverless\u003cbr/\u003eGold Layer]\n    D --\u003e|SQL Queries| E[BI Tools\u003cbr/\u003eQuickSight/Tableau]\n    D --\u003e|Ad-hoc Analysis| F[Data Analysts]\n    \n    style B fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#fff\n    style C fill:#ff9900,stroke:#232f3e,stroke-width:2px,color:#fff\n    style D fill:#8c4fff,stroke:#232f3e,stroke-width:2px,color:#fff\n    style E fill:#232f3e,stroke:#ff9900,stroke-width:2px,color:#fff\n```\n\n**Key Components:**\n- **Amazon S3**: Immutable raw data storage and staging layer\n- **AWS Glue**: Serverless ETL for data cleaning and standardization\n- **Amazon Redshift Serverless**: Analytics engine with star schema dimensional model\n- **dbt**: SQL-based transformations with built-in testing and documentation\n- **Terraform**: Infrastructure provisioning with modular, reusable components\n\n[Detailed architecture diagrams →](docs/architecture-diagram.md) | [Architecture documentation →](docs/architecture.md)\n\n## Data Model\n\nStar schema optimized for analytical queries:\n\n**Fact Table:**\n- `fact_order_items` - Grain: one row per order item\n\n**Dimensions:**\n- `dim_customers` - SCD Type 2 for historical tracking\n- `dim_products` - SCD Type 2 for price/attribute changes\n- `dim_date` - Standard date dimension for time-series analysis\n- `dim_payment_methods`\n- `dim_shipment_status`\n\nThis design prevents double-counting, enables flexible slicing/dicing, and maintains historical accuracy for point-in-time reporting.\n\n[Data model details →](docs/data_model.md)\n\n## Technology Stack\n\n| Layer | Technology | Purpose |\n|-------|-----------|---------|\n| Storage | Amazon S3 | Raw and staging data lake |\n| ETL | AWS Glue | Serverless data processing |\n| Warehouse | Redshift Serverless | Analytics engine |\n| Transformation | dbt Core | SQL-based modeling \u0026 testing |\n| IaC | Terraform | Infrastructure provisioning |\n| Orchestration | AWS Step Functions | Workflow coordination |\n| Monitoring | CloudWatch | Logging and alerting |\n\n## Project Structure\n\n```\n.\n├── analytics/\n│   ├── dbt/              # dbt models, tests, and documentation\n│   └── sql_examples/     # Sample analytical queries\n├── infra/\n│   ├── terraform/        # Infrastructure as Code\n│   │   ├── modules/      # Reusable Terraform modules\n│   │   └── environments/ # Environment-specific configs (dev/prod)\n│   └── diagrams/         # Architecture diagrams\n└── docs/                 # Technical documentation\n```\n\n## Quick Start\n\n### Prerequisites\n\n- AWS Account with appropriate IAM permissions\n- Terraform \u003e= 1.5.0\n- AWS CLI configured with credentials\n- dbt Core \u003e= 1.6.0\n- Python \u003e= 3.9\n\n### Deployment\n\n**1. Clone the repository**\n```bash\ngit clone https://github.com/\u003cyour-username\u003e/ecommerce-data-warehouse-redshift.git\ncd ecommerce-data-warehouse-redshift\n```\n\n**2. Initialize Terraform**\n```bash\ncd infra/terraform/environments/dev\nterraform init\n```\n\n**3. Review and apply infrastructure**\n```bash\nterraform plan\nterraform apply\n```\n\nThis provisions:\n- S3 buckets (raw, staging, logs)\n- Redshift Serverless namespace and workgroup\n- Glue jobs and crawlers\n- IAM roles and policies\n- VPC and security groups\n\n**4. Configure dbt**\n```bash\ncd analytics/dbt\ncp profiles.yml.example profiles.yml\n# Edit profiles.yml with your Redshift endpoint\n```\n\n**5. Run dbt transformations**\n```bash\ndbt deps\ndbt run --target dev\ndbt test\n```\n\n**6. Verify deployment**\n```bash\n# Query Redshift to confirm data loaded\naws redshift-data execute-statement \\\n  --workgroup-name ecommerce-dwh-dev \\\n  --database dev \\\n  --sql \"SELECT COUNT(*) FROM fact_order_items;\"\n```\n\n## Key Features\n\n### Infrastructure as Code\n- Modular Terraform design for reusability across environments\n- Separate state management per environment (dev/staging/prod)\n- Automated resource tagging for cost allocation\n\n### Data Quality\n- dbt tests for uniqueness, referential integrity, and not-null constraints\n- Row count reconciliation between layers\n- Freshness checks for SLA monitoring\n\n### Performance Optimization\n- Distribution keys on fact tables for co-located joins\n- Sort keys on date columns for time-series queries\n- Incremental models to minimize full table scans\n- Workload management (WLM) configuration for query prioritization\n\n### Cost Management\n- Redshift Serverless auto-scales based on workload\n- S3 lifecycle policies for archival to Glacier\n- Glue job bookmarks prevent reprocessing\n- Development environment with reduced capacity\n\n## Sample Analytics Queries\n\n**Monthly Revenue Trend:**\n```sql\nSELECT \n    d.year_month,\n    SUM(f.total_amount) as revenue\nFROM fact_order_items f\nJOIN dim_date d ON f.order_date_key = d.date_key\nGROUP BY 1\nORDER BY 1;\n```\n\n**Top Products by Revenue:**\n```sql\nSELECT \n    p.product_name,\n    SUM(f.quantity) as units_sold,\n    SUM(f.total_amount) as revenue\nFROM fact_order_items f\nJOIN dim_products p ON f.product_key = p.product_key\nWHERE p.is_current = TRUE\nGROUP BY 1\nORDER BY 3 DESC\nLIMIT 10;\n```\n\n[More examples →](analytics/sql_examples/)\n\n## Design Decisions\n\nKey architectural choices and tradeoffs:\n\n- **Redshift Serverless vs Provisioned**: Chose serverless for automatic scaling and simplified operations. Suitable for variable workloads with unpredictable query patterns.\n- **Star Schema vs Data Vault**: Star schema prioritizes query simplicity and BI tool compatibility over extreme flexibility.\n- **SCD Type 2 for dimensions**: Enables point-in-time analysis at the cost of increased storage and join complexity.\n- **Glue vs EMR**: Glue's serverless model reduces operational burden for moderate data volumes (\u003c10TB).\n- **S3 as staging layer**: Decouples ingestion from transformation, enables reprocessing, and reduces Redshift storage costs.\n\n[Full decision log →](docs/decisions.md)\n\n## Monitoring \u0026 Observability\n\n- **CloudWatch Dashboards**: Query performance, RPU consumption, data freshness\n- **Glue Job Metrics**: Success rate, duration, DPU utilization\n- **dbt Test Results**: Data quality KPIs tracked over time\n- **Cost Alerts**: Budget thresholds for Redshift and Glue\n\n## Development Workflow\n\nThis project follows GitFlow branching strategy:\n\n- `main` - Production-ready code\n- `develop` - Integration branch for features\n- `feature/*` - Individual feature branches\n- `hotfix/*` - Emergency production fixes\n\n**Contributing:**\n1. Create feature branch from `develop`\n2. Implement changes with tests\n3. Submit PR with description of changes\n4. Merge to `develop` after review\n5. Release to `main` when ready for production\n\n## Roadmap\n\n- [ ] CI/CD pipeline with GitHub Actions\n- [ ] Incremental dbt models for large fact tables\n- [ ] Real-time ingestion with Kinesis Data Firehose\n- [ ] ML integration for customer churn prediction\n- [ ] Cross-region disaster recovery\n\n## Cost Estimation\n\n**Development Environment** (monthly):\n- Redshift Serverless: ~$50-100 (8 RPU-hours/day)\n- S3 Storage: ~$5 (100GB)\n- Glue Jobs: ~$20 (daily runs)\n- **Total: ~$75-125/month**\n\n**Production Environment** (monthly, estimated):\n- Redshift Serverless: ~$500-1000 (depends on query load)\n- S3 Storage: ~$50 (1TB)\n- Glue Jobs: ~$100 (hourly incremental loads)\n- **Total: ~$650-1150/month**\n\nUse [AWS Pricing Calculator](https://calculator.aws) for detailed estimates based on your workload.\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## Author\n\n**Rocio** - AWS Data Engineer  \n[GitHub](https://github.com/\u003cyour-username\u003e) | [LinkedIn](https://linkedin.com/in/\u003cyour-profile\u003e)\n\n---\n\n*This project is a portfolio demonstration and not affiliated with any commercial entity.*\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftuni56%2Fecommerce-data-warehouse-redshift","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftuni56%2Fecommerce-data-warehouse-redshift","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftuni56%2Fecommerce-data-warehouse-redshift/lists"}