An open API service indexing awesome lists of open source software.

https://github.com/willie-conway/ibm-data-engineering-capstone-project

End-to-end Data Engineering Capstone Project using MySQL, πŸƒMongoDB, 🐘PostgreSQL, πŸ’¨Apache Airflow, ⚑️Apache Spark, and BI dashboards πŸ“ŠπŸš€
https://github.com/willie-conway/ibm-data-engineering-capstone-project

airflow capstone-project dashboards data-engineering db2-warehouse etl ibm ibm-cognos-analytics mongodb mysql piplines postgresql spark sql sqlite3

Last synced: 2 months ago
JSON representation

End-to-end Data Engineering Capstone Project using MySQL, πŸƒMongoDB, 🐘PostgreSQL, πŸ’¨Apache Airflow, ⚑️Apache Spark, and BI dashboards πŸ“ŠπŸš€

Awesome Lists containing this project

README

          

# πŸ—οΈ IBM Data Engineering Capstone Project




This capstone project showcases the practical application of key data engineering skills by simulating a real-world scenario in which I served as a Junior Data Engineer. I designed and implemented a scalable data analytics platform by working across various technologies in the data engineering lifecycle.

---

## πŸš€ Project Overview

This capstone project simulates the role of a **Junior Data Engineer** tasked with designing and implementing an end-to-end **data analytics platform** using multiple data engineering tools and technologies.
It’s the final course in the [IBM Data Engineering Professional Certificate](https://www.coursera.org/professional-certificates/ibm-data-engineer), combining all prior learning into one practical project.













---

## 🧠 What I Learned

βœ… Design and build data platforms using OLTP & OLAP architectures
βœ… Implement data pipelines with ETL processes using Python and Apache Airflow
βœ… Query structured and unstructured data using MySQL, PostgreSQL, and MongoDB
βœ… Perform big data analytics and ML predictions using Apache Spark
βœ… Visualize insights via dashboards in Google Looker Studio and IBM Cognos Analytics

---

## 🧰 Skills & Tools Used

- 🐍 Python & SQL
- 🐘 PostgreSQL | 🐬 MySQL | πŸƒ MongoDB
- πŸ› οΈ Apache Airflow
- πŸ” Apache Spark (MLlib)
- πŸ“Š IBM Cognos Analytics | Google Looker Studio
- πŸ—ƒοΈ OLTP & Data Warehousing
- 🧱 ETL & Data Pipelines
- 🐧 Linux Shell Scripting
- πŸ“‚ JSON, CSV, .tar.gz, and data transformations

---

## πŸ“¦ Modules Breakdown

| Module | Description |
|--------|-------------|
| πŸ“ **1. Data Platform Architecture & OLTP** | Designed OLTP schemas & created MySQL databases |
| πŸƒ **2. NoSQL with MongoDB** | Queried JSON documents and used MongoDB indexes |
| πŸ—„οΈ **3. Data Warehouse** | Built dimensional models & populated warehouse tables |
| πŸ“ˆ **4. Data Analytics & Reporting** | Wrote complex SQL queries with `ROLLUP`, `CUBE`, and aggregations |
| πŸ” **5. ETL & Pipelines** | Built ETL flows with Python scripts and Apache Airflow DAGs |
| ⚑ **6. Big Data Analytics with Spark** | Trained and deployed ML models using Spark MLlib |
| βœ… **7. Final Submission** | Delivered final reports, dashboards, and peer-reviewed projects |

---

## πŸ“Š Dashboard Samples

| Tool | Preview |
|------|---------|
| Google Looker Studio | ![Looker Dashboard](https://github.com/Willie-Conway/IBM-Data-Engineering-Capstone-Project/blob/a62d4dabe48342884ce4b6d77f8a95c8326ae09a/Data%20Engineering%20Capstone%20Project/CheatSheet/Images/E-Commerce_Sales_Dashboard_(2020).jpg) |
| IBM Cognos Analytics | ![Cognos Dashboard](https://github.com/Willie-Conway/IBM-Data-Engineering-Capstone/blob/c8d782d38e24a2ba26c01faf93f743b635cf07a7/Data%20Engineering%20Capstone%20Project/Labs/Dashboard%20Creation%20using%20IBM%20Cognos%20Analytics/Screenshots/E-commerce%20Sales%20Dashboard.jpg) |

---

## πŸ“‚ Project Assets

```

πŸ“ OLTP Database Design
πŸ“ NoSQL Queries & Exports
πŸ“ Data Warehouse Scripts & CSVs
πŸ“ Airflow DAGs & Python Scripts
πŸ“ SparkML Model & Predictions
πŸ“ Dashboards (Google Looker, Cognos)

```

## πŸ“Œ Key Skills Demonstrated

- πŸ—ƒοΈ Relational & NoSQL Database Design (MySQL, MongoDB)
- πŸ—οΈ Data Warehouse Modeling and Querying (PostgreSQL, IBM Db2)
- πŸ”„ ETL Pipeline Development (Python, Shell, Apache Airflow)
- πŸ”₯ Big Data Analytics with Apache Spark
- πŸ“Š Data Visualization (Google Looker Studio, IBM Cognos Analytics)
- 🐧 Linux Shell Scripting
- πŸ§ͺ SQL queries using `ROLLUP`, `CUBE`, `GROUPING SETS`, and Materialized Query Tables (MQTs)

---

## πŸ§ͺ Capstone Modules & Labs Overview

### πŸ“ Module 1: Data Platform Architecture & OLTP
- Designed an OLTP schema and created MySQL tables.
- Imported and exported data using SQL and shell scripts.
- Defined primary keys and indexes for optimized access.

### πŸƒ Module 2: Querying Data in NoSQL (MongoDB)
- Loaded product catalog data into MongoDB.
- Performed filter queries and aggregation pipelines.
- Exported collections using `mongoexport`.

### πŸ—οΈ Module 3: Building a Data Warehouse
- Created star schema with dimensions and fact tables in PostgreSQL.
- Imported e-commerce sales data.
- Performed OLAP queries with `CUBE`, `ROLLUP`, and `GROUPING SETS`.

### πŸ“ˆ Module 4: Data Analytics
- Wrote analytical SQL queries to uncover trends in sales data.
- Used Materialized Query Tables to improve performance.

### πŸ” Module 5: ETL & Data Pipelines
- Wrote Python scripts for extract, transform, and load processes.
- Automated the pipeline using Apache Airflow DAGs.
- Processed and cleaned web logs into structured format.

### ⚑ Module 6: Big Data Analytics with Apache Spark
- Used Spark to load and transform product review data.
- Built a machine learning model using Spark MLlib.
- Saved and reloaded the trained model for prediction tasks.

### πŸ“Š Module 7: Dashboards & Final Submission
- Built sales dashboards using:
- **Google Looker Studio**: Interactive charts, filters, KPIs.
- **IBM Cognos Analytics**: Custom visualizations and report generation.
- Submitted final project artifacts for peer review.

---

## 🧠 Summary

This project helped solidify my knowledge of:
- Building data infrastructure from ground up
- Managing both structured and semi-structured data
- Automating and scaling data workflows
- Communicating data insights through visual tools

---

## 🏁 Outcome

βœ… **Proficiency in end-to-end data engineering workflows**
βœ… **Prepared for real-world junior-level data engineering roles**

---

## 🧠 Reflections

This project was a culmination of weeks of learning and hands-on practice. I strengthened my data engineering foundations and became confident in building real-world data solutions end-to-end. πŸ§©πŸ’‘

---

## πŸ’Ό Ideal For

- Hiring managers evaluating full-stack data engineers
- Recruiters seeking professionals skilled in data architecture, pipelines, and analytics
- Anyone interested in practical data engineering workflows

---

## πŸ”— Looker Dashboards

- [Loyalty & Sales Performance Dashboard](https://lookerstudio.google.com/s/igUfnRY4S6M)
- [E-Commerce_Sales_Dashboard_(2020)](https://lookerstudio.google.com/s/nBv_zBtawc4)
- [Simple_Dashboard](https://lookerstudio.google.com/s/lz8wjFuj5Nw)
- [Community_Property_Revenue_&_Loyalty_Sales_Dashboard](https://lookerstudio.google.com/s/ielt85KR3Sw)
- [Sales & Service Dashboard](https://lookerstudio.google.com/s/mMf7a_kwdAo)

---

## 🏁 Let's Connect!

If you're interested in my other data projects or collaborations:
🌐 [My Portfolio](#) | πŸ’Ό [LinkedIn](#) | πŸ“‚ [GitHub Projects](#)