https://github.com/stevehoober254/dataengineer-portfolio
📊 End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing
https://github.com/stevehoober254/dataengineer-portfolio
airflow analytics big-data dagster data-engineering data-lake data-pipelines etl python spark
Last synced: 6 months ago
JSON representation
📊 End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing
- Host: GitHub
- URL: https://github.com/stevehoober254/dataengineer-portfolio
- Owner: stevehoober254
- Created: 2025-04-10T13:48:31.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-04-10T14:37:49.000Z (6 months ago)
- Last Synced: 2025-04-10T15:51:51.377Z (6 months ago)
- Topics: airflow, analytics, big-data, dagster, data-engineering, data-lake, data-pipelines, etl, python, spark
- Homepage:
- Size: 5.86 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 📊 Data Engineer Portfolio
A practical portfolio of data engineering pipelines, orchestrated DAGs, and analytics notebooks. These projects demonstrate end-to-end ETL processes, real-time ingestion, data lake design, and Python-based transformations.
## 📌 Highlights
- Apache Airflow DAG orchestration
- Batch and streaming ETL pipelines
- Python & Pandas-based data wrangling
- Data validation and unit testing
- Jupyter notebooks with visual insights## Project List
## 1. Smart Grid IoT Data Pipeline
### Problem
Power companies in emerging markets struggle to track real-time grid performance.### Solution
Build an end-to-end pipeline that:
- Ingests data from simulated smart meters via **AWS Kinesis**
- Transforms with **AWS Glue** + **Apache Hudi**
- Loads into **Redshift**
- Visualized in **Amazon QuickSight**### Goals
- Stream real-time energy usage
- Aggregate usage by time, region, household
- Detect anomalies and outages---
## 2. Kenya Open Data Explorer
### Problem
Government data is available but not easily analyzable for citizens or journalists.### Solution
Create a public analytics dashboard:
- ETL pipelines in **Apache Airflow**
- Cleaned datasets in **BigQuery**
- Visualizations in **Metabase**
- Public search and filter frontend using **Next.js**### Goals
- Process and publish monthly updated datasets
- Make visual data stories (health, education, environment)
- Enable CSV downloads and API access---
## 3. Political Sentiment Analysis Pipeline
### Problem
Election stakeholders need real-time sentiment insights from social media.### Solution
Stream political tweets and comments:
- **Kafka** or **Kinesis Firehose** for ingestion
- **Spark Structured Streaming** for processing
- **S3** + **PrestoDB** for storage and querying
- Dashboard built with **Apache Superset**### Goals
- Classify sentiments: positive, neutral, negative
- Track by politician, region, or hashtag
- Show trending concerns or hate speech