https://github.com/willie-conway/ibm-data-engineering-capstone-project
End-to-end Data Engineering Capstone Project using MySQL, πMongoDB, πPostgreSQL, π¨Apache Airflow, β‘οΈApache Spark, and BI dashboards ππ
https://github.com/willie-conway/ibm-data-engineering-capstone-project
airflow capstone-project dashboards data-engineering db2-warehouse etl ibm ibm-cognos-analytics mongodb mysql piplines postgresql spark sql sqlite3
Last synced: 2 months ago
JSON representation
End-to-end Data Engineering Capstone Project using MySQL, πMongoDB, πPostgreSQL, π¨Apache Airflow, β‘οΈApache Spark, and BI dashboards ππ
- Host: GitHub
- URL: https://github.com/willie-conway/ibm-data-engineering-capstone-project
- Owner: Willie-Conway
- License: mit
- Created: 2025-06-14T18:59:49.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-26T06:17:39.000Z (12 months ago)
- Last Synced: 2025-06-30T01:42:40.854Z (12 months ago)
- Topics: airflow, capstone-project, dashboards, data-engineering, db2-warehouse, etl, ibm, ibm-cognos-analytics, mongodb, mysql, piplines, postgresql, spark, sql, sqlite3
- Language: Jupyter Notebook
- Homepage: https://developers.google.com/profile/u/109845255803256255656
- Size: 25.4 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ποΈ IBM Data Engineering Capstone Project
This capstone project showcases the practical application of key data engineering skills by simulating a real-world scenario in which I served as a Junior Data Engineer. I designed and implemented a scalable data analytics platform by working across various technologies in the data engineering lifecycle.
---
## π Project Overview
This capstone project simulates the role of a **Junior Data Engineer** tasked with designing and implementing an end-to-end **data analytics platform** using multiple data engineering tools and technologies.
Itβs the final course in the [IBM Data Engineering Professional Certificate](https://www.coursera.org/professional-certificates/ibm-data-engineer), combining all prior learning into one practical project.
---
## π§ What I Learned
β
Design and build data platforms using OLTP & OLAP architectures
β
Implement data pipelines with ETL processes using Python and Apache Airflow
β
Query structured and unstructured data using MySQL, PostgreSQL, and MongoDB
β
Perform big data analytics and ML predictions using Apache Spark
β
Visualize insights via dashboards in Google Looker Studio and IBM Cognos Analytics
---
## π§° Skills & Tools Used
- π Python & SQL
- π PostgreSQL | π¬ MySQL | π MongoDB
- π οΈ Apache Airflow
- π Apache Spark (MLlib)
- π IBM Cognos Analytics | Google Looker Studio
- ποΈ OLTP & Data Warehousing
- π§± ETL & Data Pipelines
- π§ Linux Shell Scripting
- π JSON, CSV, .tar.gz, and data transformations
---
## π¦ Modules Breakdown
| Module | Description |
|--------|-------------|
| π **1. Data Platform Architecture & OLTP** | Designed OLTP schemas & created MySQL databases |
| π **2. NoSQL with MongoDB** | Queried JSON documents and used MongoDB indexes |
| ποΈ **3. Data Warehouse** | Built dimensional models & populated warehouse tables |
| π **4. Data Analytics & Reporting** | Wrote complex SQL queries with `ROLLUP`, `CUBE`, and aggregations |
| π **5. ETL & Pipelines** | Built ETL flows with Python scripts and Apache Airflow DAGs |
| β‘ **6. Big Data Analytics with Spark** | Trained and deployed ML models using Spark MLlib |
| β
**7. Final Submission** | Delivered final reports, dashboards, and peer-reviewed projects |
---
## π Dashboard Samples
| Tool | Preview |
|------|---------|
| Google Looker Studio | .jpg) |
| IBM Cognos Analytics |  |
---
## π Project Assets
```
π OLTP Database Design
π NoSQL Queries & Exports
π Data Warehouse Scripts & CSVs
π Airflow DAGs & Python Scripts
π SparkML Model & Predictions
π Dashboards (Google Looker, Cognos)
```
## π Key Skills Demonstrated
- ποΈ Relational & NoSQL Database Design (MySQL, MongoDB)
- ποΈ Data Warehouse Modeling and Querying (PostgreSQL, IBM Db2)
- π ETL Pipeline Development (Python, Shell, Apache Airflow)
- π₯ Big Data Analytics with Apache Spark
- π Data Visualization (Google Looker Studio, IBM Cognos Analytics)
- π§ Linux Shell Scripting
- π§ͺ SQL queries using `ROLLUP`, `CUBE`, `GROUPING SETS`, and Materialized Query Tables (MQTs)
---
## π§ͺ Capstone Modules & Labs Overview
### π Module 1: Data Platform Architecture & OLTP
- Designed an OLTP schema and created MySQL tables.
- Imported and exported data using SQL and shell scripts.
- Defined primary keys and indexes for optimized access.
### π Module 2: Querying Data in NoSQL (MongoDB)
- Loaded product catalog data into MongoDB.
- Performed filter queries and aggregation pipelines.
- Exported collections using `mongoexport`.
### ποΈ Module 3: Building a Data Warehouse
- Created star schema with dimensions and fact tables in PostgreSQL.
- Imported e-commerce sales data.
- Performed OLAP queries with `CUBE`, `ROLLUP`, and `GROUPING SETS`.
### π Module 4: Data Analytics
- Wrote analytical SQL queries to uncover trends in sales data.
- Used Materialized Query Tables to improve performance.
### π Module 5: ETL & Data Pipelines
- Wrote Python scripts for extract, transform, and load processes.
- Automated the pipeline using Apache Airflow DAGs.
- Processed and cleaned web logs into structured format.
### β‘ Module 6: Big Data Analytics with Apache Spark
- Used Spark to load and transform product review data.
- Built a machine learning model using Spark MLlib.
- Saved and reloaded the trained model for prediction tasks.
### π Module 7: Dashboards & Final Submission
- Built sales dashboards using:
- **Google Looker Studio**: Interactive charts, filters, KPIs.
- **IBM Cognos Analytics**: Custom visualizations and report generation.
- Submitted final project artifacts for peer review.
---
## π§ Summary
This project helped solidify my knowledge of:
- Building data infrastructure from ground up
- Managing both structured and semi-structured data
- Automating and scaling data workflows
- Communicating data insights through visual tools
---
## π Outcome
β
**Proficiency in end-to-end data engineering workflows**
β
**Prepared for real-world junior-level data engineering roles**
---
## π§ Reflections
This project was a culmination of weeks of learning and hands-on practice. I strengthened my data engineering foundations and became confident in building real-world data solutions end-to-end. π§©π‘
---
## πΌ Ideal For
- Hiring managers evaluating full-stack data engineers
- Recruiters seeking professionals skilled in data architecture, pipelines, and analytics
- Anyone interested in practical data engineering workflows
---
## π Looker Dashboards
- [Loyalty & Sales Performance Dashboard](https://lookerstudio.google.com/s/igUfnRY4S6M)
- [E-Commerce_Sales_Dashboard_(2020)](https://lookerstudio.google.com/s/nBv_zBtawc4)
- [Simple_Dashboard](https://lookerstudio.google.com/s/lz8wjFuj5Nw)
- [Community_Property_Revenue_&_Loyalty_Sales_Dashboard](https://lookerstudio.google.com/s/ielt85KR3Sw)
- [Sales & Service Dashboard](https://lookerstudio.google.com/s/mMf7a_kwdAo)
---
## π Let's Connect!
If you're interested in my other data projects or collaborations:
π [My Portfolio](#) | πΌ [LinkedIn](#) | π [GitHub Projects](#)