https://github.com/royungar/datawarehouse_waste_management_project

PostgreSQL-based data warehouse project for a Brazilian solid waste management company. Uses a star schema, advanced SQL (GROUPING SETS, ROLLUP, CUBE), and materialized views for multidimensional reporting.
https://github.com/royungar/datawarehouse_waste_management_project

analytical-queries data-analytics data-engineering data-modeling data-warehouse etl ibm pgadmin postgresql sql star-schema

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/royungar/datawarehouse_waste_management_project
Owner: royungar
Created: 2025-07-23T17:34:48.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-07-28T00:53:36.000Z (12 months ago)
Last Synced: 2025-07-28T02:35:07.001Z (12 months ago)
Topics: analytical-queries, data-analytics, data-engineering, data-modeling, data-warehouse, etl, ibm, pgadmin, postgresql, sql, star-schema
Homepage: https://www.linkedin.com/in/royungar/
Size: 1.77 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Data Warehouse – Solid Waste Management Project – IBM Data Engineering Professional Certificate

## Overview

This project demonstrates the design and implementation of a data warehouse for a Brazilian solid waste management company.
This warehouse supports waste collection performance analysis across time, geography, and vehicle dimensions.
The project models operational waste collection data using a star schema in PostgreSQL, enabling analytical queries across time, location, and vehicle dimensions.
It was completed as the final project for **Course 9 – Data Warehouse Fundamentals** in the [IBM Data Engineering Professional Certificate](https://www.coursera.org/professional-certificates/ibm-data-engineer).

---

## Objectives

- Design a star schema optimized for analytical queries
- Create dimension and fact tables in PostgreSQL using pgAdmin
- Load operational data from CSV files into the database
- Write analytical SQL queries using `GROUPING SETS`, `ROLLUP`, and `CUBE`
- Create a materialized view to improve query performance

---

## Tools & Technologies

| Category | Tools/Technologies |
| ---------------- | ------------------- |
| **Database** | PostgreSQL, pgAdmin |
| **Languages** | SQL |
| **File Formats** | CSV |

---

## Data Warehouse Schema

The schema uses a star design with one fact table and four dimension tables:

### Dimension Tables

- **DimDate**: Stores date breakdowns (year, month, day, weekday, etc.)
- **DimTruck**: Stores truck identifiers and types
- **DimStation**: Stores collection station IDs and associated cities
- **DimWaste**: Stores waste category types

### Fact Table

- **FactTrips**: Captures waste collection trips with date, station, truck, and amount of waste collected

---

## Data Files

All CSVs were provided through IBM Skills Network:

- `DimDate.csv`
- `DimTruck.csv`
- `DimStation.csv`
- `FactTrips.csv`

---

## Implementation Steps

### 1. Database Setup

- Created a PostgreSQL database in pgAdmin
- Defined tables with `PRIMARY KEY` and `FOREIGN KEY` constraints to maintain referential integrity

### 2. Schema Creation

- Created `DimDate`, `DimTruck`, `DimStation`, and `DimWaste` tables using SQL `CREATE TABLE` statements
- Defined `FactTrips` with foreign keys referencing each dimension

### 3. Data Loading

- Imported all CSV files using pgAdmin's Import feature
- Validated successful data loads and relationships

### 4. Aggregation Queries

Used SQL aggregation clauses for multidimensional reporting:

#### GROUPING SETS

- Summarized total waste collected by combinations of station and truck type
- Enables flexible reporting across one or multiple dimensions

#### ROLLUP

- Produced hierarchical aggregates (year > city > station)
- Supports drill-down reports by time and geography

#### CUBE

- Generated average waste metrics across all combinations of year, city, and station
- Facilitates full cross-dimensional waste analysis

### 5. Materialized View

- Created `max_waste_stats` to precompute the maximum waste collected per city, station, and truck type
- The view enables faster reporting on high-volume collection combinations

---

## Results & Benchmarks

- **Warehouse query performance:** Reduced “maximum waste collected per city/station/truck” query time from ~73.8 ms to ~1 ms by using a PostgreSQL materialized view (**~98.6% faster**).
- **Warehouse scale (repo data):** 106,400 fact rows; 350 dates; 71 trucks; 19 stations.
- **Repro notes:** Times measured locally with `\timing on` in psql; materialized view defined as `max_waste_stats`.

---

## Sample Reports Enabled by Warehouse

- Total waste collected per truck type and station
- City-wide waste collection trends across time
- Highest volume truck/station combinations
- Average waste collected per station

---

## Repository Structure

```plaintext
DataWarehouse_Waste_Management_Project/
├── README.md # Project overview, schema, SQL tasks, and usage
├── data/
│ ├── DimDate.csv # Calendar and weekday data
│ ├── DimTruck.csv # Truck type identifiers
│ ├── DimStation.csv # Station and city mapping
│ └── FactTrips.csv # Waste collection fact data
├── images/ # Screenshots showing schema, queries, and output
│ ├── schema_erd.png # ERD showing all 5 tables and relationships
│ ├── grouping_sets_query.png # GROUPING SETS query output
│ ├── rollup_query.png # ROLLUP query output
│ ├── cube_query.png # CUBE query output
│ └── mview_creation.png # Materialized view creation confirmation
├── sql/
│ ├── schema/ # CREATE TABLE scripts for all dimension and fact tables
│ │ ├── DimDate.sql # Creates DimDate table
│ │ ├── DimStation.sql # Creates DimStation table
│ │ ├── DimTruck.sql # Creates DimTruck table
│ │ ├── DimWaste.sql # Creates DimWaste table
│ │ └── FactTrips.sql # Creates FactTrips table with foreign keys
│ ├── aggregation_queries/ # GROUPING SETS, ROLLUP, and CUBE queries
│ │ ├── grouping_sets.sql # Aggregates waste by station ID and truck type
│ │ ├── rollup.sql # Summarizes waste hierarchically by year, city, and station
│ │ └── cube.sql # Computes average waste across all dimensions
│ └── materialized_views/ # Materialized view creation script
│ └── max_waste_stats.sql # Computes max waste by city, station, and truck
```

---

## License

This project was completed as part of the IBM Data Engineering Professional Certificate and is intended for educational use.

## Links

- Course Page - [Data Warehouse Fundamentals](https://www.coursera.org/learn/data-warehouse-fundamentals)
- [GitHub Profile](https://github.com/royungar)
- [GitHub Repository](https://github.com/royungar/DataWarehouse_Waste_Management_Project)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/royungar/datawarehouse_waste_management_project

Awesome Lists containing this project

README