https://github.com/stevehoober254/dataengineer-portfolio
📊 End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing
https://github.com/stevehoober254/dataengineer-portfolio
airflow analytics big-data dagster data-engineering data-lake data-pipelines etl python spark
Last synced: about 2 months ago
JSON representation
📊 End-to-end ETL pipelines, Airflow DAGs, notebook-driven analytics & data warehousing
- Host: GitHub
- URL: https://github.com/stevehoober254/dataengineer-portfolio
- Owner: stevehoober254
- Created: 2025-04-10T13:48:31.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-10T19:55:02.000Z (about 1 year ago)
- Last Synced: 2025-10-09T02:32:44.882Z (8 months ago)
- Topics: airflow, analytics, big-data, dagster, data-engineering, data-lake, data-pipelines, etl, python, spark
- Homepage:
- Size: 6.84 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
Awesome Lists containing this project
README
# 📊 Data Engineer Portfolio
A practical portfolio of data engineering pipelines, orchestrated DAGs, and analytics notebooks. These projects demonstrate end-to-end ETL processes, real-time ingestion, data lake design, and Python-based transformations.
## 📌 Highlights
- Apache Airflow DAG orchestration
- Batch and streaming ETL pipelines
- Python & Pandas-based data wrangling
- Data validation and unit testing
- Jupyter notebooks with visual insights
## Project List
## 1. Smart Grid IoT Data Pipeline
### Problem
Power companies in emerging markets struggle to track real-time grid performance.
### Solution
Build an end-to-end pipeline that:
- Ingests data from simulated smart meters via **AWS Kinesis**
- Transforms with **AWS Glue** + **Apache Hudi**
- Loads into **Redshift**
- Visualized in **Amazon QuickSight**
### Goals
- Stream real-time energy usage
- Aggregate usage by time, region, household
- Detect anomalies and outages
---
## 2. Kenya Open Data Explorer
### Problem
Government data is available but not easily analyzable for citizens or journalists.
### Solution
Create a public analytics dashboard:
- ETL pipelines in **Apache Airflow**
- Cleaned datasets in **BigQuery**
- Visualizations in **Metabase**
- Public search and filter frontend using **Next.js**
### Goals
- Process and publish monthly updated datasets
- Make visual data stories (health, education, environment)
- Enable CSV downloads and API access
---
## 3. Political Sentiment Analysis Pipeline
### Problem
Election stakeholders need real-time sentiment insights from social media.
### Solution
Stream political tweets and comments:
- **Kafka** or **Kinesis Firehose** for ingestion
- **Spark Structured Streaming** for processing
- **S3** + **PrestoDB** for storage and querying
- Dashboard built with **Apache Superset**
### Goals
- Classify sentiments: positive, neutral, negative
- Track by politician, region, or hashtag
- Show trending concerns or hate speech