https://github.com/pizofreude/data-engineer-notes
Personal notes and lab solutions for the Data Engineer Handbook Bootcamp
https://github.com/pizofreude/data-engineer-notes
apache-flink apache-spark communication data-engineering data-quality database dimensional-data-modeling fact-data-modeling impact kafka
Last synced: 2 months ago
JSON representation
Personal notes and lab solutions for the Data Engineer Handbook Bootcamp
- Host: GitHub
- URL: https://github.com/pizofreude/data-engineer-notes
- Owner: pizofreude
- License: apache-2.0
- Created: 2025-07-11T03:10:59.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-07-11T05:22:13.000Z (11 months ago)
- Last Synced: 2025-07-11T08:24:34.486Z (11 months ago)
- Topics: apache-flink, apache-spark, communication, data-engineering, data-quality, database, dimensional-data-modeling, fact-data-modeling, impact, kafka
- Language: Makefile
- Homepage:
- Size: 22.5 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Data Engineer Bootcamp Notes
Welcome to my personal notes for the **Data Engineer Handbook Bootcamp**.
This repository is my learning journal, containing summaries, key concepts, and lab solutions for the 6-week bootcamp. It complements my forked [Data Engineer Handbook repo](https://github.com/pizofreude/data-engineer-handbook).
---
## Repository Structure
```bash
data-engineer-notes/
├── README.md
├── resources.md
├── assets/
├── images/
├── week00/
│ ├── summary.md
│ ├── key-concepts.md
│ ├── lab-notes.md
│ └── lab00/
│ ├── solution.ipynb
│ └── ... # Artifacts from bootcamp materials
├── week01/
│ └── (similar structure)
├── ...
└── week06/
└── (similar structure)
```
### Week Notes
Each week contains:
- **Summary:** Key takeaways from the week
- **Key Concepts:** Detailed explanations and examples of core ideas
- **Lab Notes:** Observations, detailed notes, and troubleshooting during labs
- **Labs:** Solutions for each lab
---
## Links
- **Original Handbook:** [DataExpert-io/data-engineer-handbook](https://github.com/DataExpert-io/data-engineer-handbook)
- **My Fork:** [pizofreude/data-engineer-handbook](https://github.com/pizofreude/data-engineer-handbook)
---
## Learning Progress
- [X] Module 1: Bootcamp Orientation - Database setup and Boot Camp Kickoff [Week 0]
- [ ] Bootcamp Kickoff | 20 min
- [ ] Boot Camp Database Setup | 20 min
- [ ] Module 2: Dimensional Data Modeling [Week 1]
- [ ] Dimensional Data Modeling Complex Data Type and Cumulation Day 1 Lecture | 43 min
- [ ] Dimensional Data Modeling Complex Data Type and Cumulation Day 1 Lab | 41 min
- [ ] Dimensional Data Modeling: Building Slowly Changing Dimensions Day 2 Lecture | 40 min
- [ ] Dimensional Data Modeling: Building Slowly Changing Dimensions Day 2 Lab | 45 min
- [ ] Dimensional Data Modeling: Graph Data Modeling Day 3 Lecture | 34 min
- [ ] Dimensional Data Modeling: Graph Data Modeling Day 3 Lab | 46 min
- [ ] Dimensional Data Modeling - Week 1 Assignment
- [ ] Module 3: Fact Data Modeling [Week 2]
- [ ] Fact Data Modeling: Core Concepts, Deduplication Day 1 Lecture | 52 min
- [ ] Fact Data Modeling: Practical Insights into Data Modeling Day 1 Lab | 40 min
- [ ] Fact Data Modeling: Core Elements in Data Modeling Day 2 Lecture | 31 min
- [ ] Fact Data Modeling: Compact Tables for Efficient Data Representation Day 2 Lab | 45 min
- [ ] Fact Data Modeling: Minimizing Shuffle and Reducing Facts Day 3 Lecture | 32 min
- [ ] Fact Data Modeling: Practical Guide to Formatting and Aggregating Data Day 3 Lab | 30 min
- [ ] Fact Data Modeling - Week 2 Assignment
- [ ] Module 4: Apache Spark Fundamentals [Week 3]
- [ ] Apache Spark: Architecture, Optimization, and Best Practices Day 1 Lecture | 48 min
- [ ] Apache Spark: Hands-On for Broadcast and Hash Joins Day 1 Lab | 26 min
- [ ] Apache Spark: Managing Spark Jobs and Notebooks Day 2 Lecture | 34 min
- [ ] Apache Spark: User-Defined Functions and Broadcast Join Day 2 Lab | 36 min
- [ ] Unit Testing Spark Jobs: Importance, Challenges, and Leadership Perspectives Lecture | 41 min
- [ ] Unit Testing Spark Jobs: Mastering Spark and PySpark Testing Lab | 27 min
- [ ] Spark Fundamentals - Week 3 Assignment
- [ ] Module 5: Applying Analytical Patterns [Week 4]
- [ ] Applying Analytical Patterns: Exploring SQL, Scaling Projects and Aggregation Analysis Day 1 Lecture | 52 min
- [ ] Applying Analytical Patterns: Mastering Growth Accounting and Retention Analysis Day 1 Lab | 34 min
- [ ] Applying Analytical Patterns: Recursive CTEs and Window Functions Day 2 Lecture | 44 min
- [ ] Applying Analytical Patterns: Aggregations and Cardinality Reduction Day 2 Lab | 33 min
- [ ] Applying Analytical Patterns - Week 4 Assignment
- [ ] Module 6: Real-time pipelines with Flink and Kafka [Week 5]
- [ ] Flink Lab Setup | 7 min
- [ ] Streaming Pipelines: Mastering Streaming and Real-time Pipelines Day 1 Lecture | 50 min
- [ ] Streaming Pipelines: Setting up Streaming Pipelines Day 1 Lab | 40 min
- [ ] Streaming Pipelines: Exploring Data Collection and Processing Day 2 Lecture | 31 min
- [ ] Streaming Pipelines: Kafka, Postgres, Spark Integrations and Parallelism Day 2 Lab | 39 min
- [ ] Flink - Week 5 Assignment
- [ ] Module 7: Data Visualization and Impact [Week 6 Part 1]
- [ ] Data Visualization and Impact: Mastering Data Engineering Day 1 Lecture | 39 min
- [ ] Data Visualization and Impact: Hands-On with the CSV files Day 1 Lab | 8 min
- [ ] Data Visualization and Impact: Insights and Best Practices Day 2 Lecture | 23 min
- [ ] Data Visualization and Impact: Exploring Data Visualization and Aggregation Techniques Day 2 Lab | 37 min
- [ ] Data Visualization - Week 6 1st Assignment
- [ ] Module 8: Data Pipeline Maintenance [Week 6 Part 2]
- [ ] Data Pipeline Maintenance: Navigating the Complexities of Data Engineering Day 1 Lecture | 67 min
- [ ] Data Pipeline Maintenance: Strategies for Maintenance and Dock Building Day 2 Lecture | 77 min
- [ ] Data Pipeline Maintenance - Week 6 2nd Assignment
- [ ] Module 9: KPIs and Experimentation [Week 6 Part 3]
- [ ] KPIs and Experimentation: Decoding Business Success: Metrics, Growth Strategies and Collaborative Approaches Day 1 Lecture | 55 min
- [ ] KPIs and Experimentation: Setting up and Analysing Experiments Day 1 Lab | 36 min
- [ ] KPIs and Experimentation: Leading and Lagging Metrics Day 2 Lecture | 65 min
- [ ] KPIs and Experimentation - Week 6 3rd Assignment
- [ ] Module 10: Data Quality Patterns [Week 7]
- [ ] Data Quality Patterns: MIDAS Process from Airbnb Day 1 Lecture | 45 min
- [ ] Data Quality Patterns: Spec-Building Document Day 1 Lab | 33 min
- [ ] Data Quality Patterns: WAP Patterns Day 2 Lecture | 27 min
---
## 💻 Daily Practice System
This repository now includes a comprehensive practice tracking system to organize daily coding practice across multiple platforms:
- **[practice/](practice/)** - Platform-organized coding problems (LeetCode, StrataScratch, HackerRank, NeetCode, Codewars, etc.)
- **[concepts/](concepts/)** - Reference notes on data structures, algorithms, SQL patterns, and system design
- **[interview-prep/](interview-prep/)** - Interview-specific preparation materials (behavioral, technical, system design)
- **[logs/](logs/)** - Daily practice logs and progress tracking with statistics dashboard
### Quick Start
```bash
# Create today's log entry
./scripts/new-day.sh
# Start a new problem
# ./scripts/create-problem.sh "problem-name"
./scripts/create-problem.sh leetcode medium "problem-name"
# Create a concept note
# ./scripts/link-concept.sh "concept-name"
./scripts/link-concept.sh "Window Functions" sql-patterns
# Generate weekly stats
python scripts/generate-stats.py
```
### ✅ Manual Update Checklist: For Every Problem You Solve
Here's your **streamlined checklist** for logging each problem.
#### 📋 **The 5-Step Workflow**
##### **Step 1: Start Your Day** ⏱️ *30 seconds*
**Run once per day (first thing in the morning):**
```bash
./scripts/new-day.sh
```
**✅ Done!** No manual edits needed for this step.
---
##### **Step 2: Scaffold the Problem** ⏱️ *30 seconds*
**For each new problem you're about to solve:**
```bash
./scripts/create-problem.sh ""
```
**Examples:**
```bash
./scripts/create-problem.sh codewars easy "absolute-value-log-base"
./scripts/create-problem.sh leetcode medium "rank-scores"
./scripts/create-problem.sh stratascratch hard "revenue-analysis"
```
**✅ Done!** Folder created, template copied, ready to code.
---
##### **Step 3: Write Your Solution** ⏱️ *10-30 minutes (solving time)*
Navigate to the problem folder:
```bash
cd practice///
```
**Open and write your solution:**
```bash
code solution. sql # For SQL problems
# OR
code solution.py # For Python/algorithm problems
```
**What to do:**
- ✏️ **Paste your working solution code**
- ✏️ **Add comments explaining key logic** (optional but recommended)
- 💾 **Save the file**
**Example:**
```sql
-- Calculate absolute value and logarithm base 64
SELECT
ABS(number1) AS abs,
LOG(64, number2) AS log
FROM decimals;
```
---
##### **Step 4: Document Your Solution** ⏱️ *10-15 minutes*
**Open the notes file:**
```bash
code notes.md
```
**You need to manually update these sections:**
##### **A. Metadata (Top of file)**
```markdown
# [Problem Name] ← CHANGE THIS
## 📋 Metadata
- **Platform:** [Platform name] ← CHANGE THIS
- **Difficulty:** [Easy/Medium/Hard] (Optional: add platform rating like "7 kyu") ← CHANGE THIS
- **Date Solved:** 2026-01-03 ← ✅ ALREADY FILLED BY SCRIPT
- **Time Spent:** XX minutes ← CHANGE THIS
- **Status:** [✅ Solved | 🔄 Revisit | ❌ Stuck] ← CHANGE THIS
```
**Example:**
```markdown
# Absolute Value and Log to Base
## 📋 Metadata
- **Platform:** Codewars
- **Difficulty:** Easy (7 kyu)
- **Date Solved:** 2026-01-03 ← Script filled this
- **Time Spent:** 15 minutes
- **Status:** ✅ Solved
```
---
##### **B. Links**
```markdown
## 🔗 Links
- [Problem URL] ← PASTE THE ACTUAL URL HERE
```
**Example:**
```markdown
## 🔗 Links
- https://www.codewars.com/kata/594a8f2f7ca3c692a4000041/train/sql
```
---
##### **C. Topics & Tags (Check the boxes)**
```markdown
## 📚 Topics & Tags
- [ ] SQL
- [ ] Window Functions
- [ ] Joins
- [ ] CTEs
- [ ] Python
- [ ] Dynamic Programming
```
**Check the relevant ones:**
```markdown
##📚 Topics & Tags
- [x] SQL ← Put 'x' inside
- [x] Mathematical Functions
- [ ] Window Functions
- [ ] Joins
```
---
##### **D. Problem Statement**
```markdown
## 📝 Problem Statement
[Paste the problem description here]
### Example Input/Output
```markdown
Input:
Output:
```
**What to do:**
- ✏️ Copy-paste the problem description from the platform
- ✏️ Add example input/output (if provided)
---
##### **E. Approach**
```markdown
## 💡 Approach
### Initial Thoughts
[What was your first idea? What patterns did you recognize?]
### Solution Strategy
1. Step 1
2. Step 2
3. Step 3
```
**What to do:**
- ✏️ Write your thought process (2-3 sentences)
- ✏️ List the steps you took (bullet points)
**Example:**
```markdown
## 💡 Approach
### Initial Thoughts
Straightforward application of SQL math functions: ABS for absolute value, LOG for logarithm with custom base.
### Solution Strategy
1. Use `ABS(number1)` to get absolute values
2. Use `LOG(64, number2)` for logarithm base 64
3. Alias columns as required (`abs`, `log`)
```
---
##### **F. Solution**
```markdown
## 🖥️ Solution
### Attempt 1 (Initial)
```sql
-- Your first solution here
```
**Result:** [Passed/Failed/Timeout]
```
**What to do:**
- ✏️ Paste your solution code (can be same as `solution.sql`)
- ✏️ Note if it passed or failed
**If you optimized it, add:**
```markdown
### Attempt 2 (Optimized) ⭐
```sql
-- Improved solution
```
**Result:** ✅ Passed with better performance
```
---
#### **G. Complexity Analysis**
```markdown
## ⚡ Complexity Analysis
- **Time Complexity:** O(?)
- **Space Complexity:** O(?)
```
**What to do:**
- ✏️ Fill in the Big O notation
- ✏️ If you don't know, write: "Time: O(n) - single pass through table"
**Example:**
```markdown
## ⚡ Complexity Analysis
- **Time Complexity:** O(n) - single pass through table
- **Space Complexity:** O(n) - result set same size as input
```
---
#### **H. Key Learnings**
```markdown
## 🎓 Key Learnings
1.
2.
3.
```
**What to do:**
- ✏️ Write 2-4 things you learned (this is THE MOST IMPORTANT SECTION!)
**Example:**
```markdown
## 🎓 Key Learnings
1. **ABS()** - Returns absolute value (distance from zero)
2. **LOG(base, value)** - PostgreSQL syntax for custom base logarithm
3. PostgreSQL uses `LOG(base, value)` while MySQL uses `LOG(value) / LOG(base)`
4. Base-64 logarithm: `LOG(64, 4096) = 2` because 64² = 4096
```
---
#### **I. Related Concepts (Optional)**
```markdown
## 🏷️ Related Concepts
See: `concepts/sql-patterns/[concept-file]. md`
```
**What to do:**
- ✏️ If you created a concept note, link it here
- ⏭️ Skip if you haven't created a concept yet
**Example:**
```markdown
## 🏷️ Related Concepts
See: `concepts/sql-patterns/sql-mathematical-functions.md`
```
---
### **Step 5: Update Daily Log** ⏱️ *5 minutes*
**Open today's log:**
```bash
cd ../../../../ # Return to repo root
code logs/2026/01-january. md
```
**Find today's date section** and fill in:
#### **A. Time Spent**
```markdown
### Friday, January 03, 2026
⏱️ Time: X hours ← CHANGE THIS
```
**Example:**
```markdown
⏱️ Time: 1.5 hours
```
---
#### **B. Problems Completed**
```markdown
#### ✅ Completed
1.
```
**Add each problem with:**
- Problem name and difficulty
- Key topics
- Link to your solution
- One-line key learning
**Example:**
```markdown
#### ✅ Completed
1. **Codewars - Absolute Value and Log to Base** (Easy/7kyu)
- Topics: ABS(), LOG(), Mathematical functions
- [Solution](../../practice/codewars/easy/absolute-value-log-base/)
- Key learning: PostgreSQL LOG(base, value) syntax differs from MySQL
2. **LeetCode 178 - Rank Scores** (Medium)
- Topics: Window functions, DENSE_RANK
- [Solution](../../practice/leetcode/medium/178-rank-scores/)
- Key learning: DENSE_RANK vs RANK vs ROW_NUMBER differences
```
---
#### **C. Learnings**
```markdown
#### 💡 Learnings
-
```
**Write 2-4 broader learnings from today:**
**Example:**
```markdown
#### 💡 Learnings
- Mathematical functions in SQL are database-specific (PostgreSQL vs MySQL syntax)
- Always check for NULL values when using LOG() with user input
- ABS() is useful for calculating distances and differences
- Created concept note: `concepts/sql-patterns/sql-mathematical-functions.md`
```
---
#### **D. Tomorrow's Plan**
```markdown
#### 🎯 Tomorrow
- [ ]
```
**Plan 2-3 things for tomorrow:**
**Example:**
```markdown
#### 🎯 Tomorrow
- [ ] LeetCode 180 - Consecutive Numbers (Window functions practice)
- [ ] StrataScratch - Revenue analysis problem
- [ ] Review: Self-joins pattern
```
---
### **Step 6: Commit & Push** ⏱️ *1 minute*
```bash
git status
# Add your changes
git add practice////
git add logs/2026/01-january. md
# If you created a concept note, add it too
git add concepts/
# Commit with descriptive message
git commit -m "✅ [Platform]: [Problem Name] - [Key Topic]"
# Push to GitHub
git push
```
**Example commit messages:**
```bash
git commit -m "✅ Codewars: Absolute Value and Log to Base - SQL math functions"
git commit -m "✅ LeetCode 178: Rank Scores - Window functions"
git commit -m "✅ StrataScratch: Revenue Analysis - CTEs and aggregations"
```
---
## 📊 **Weekly: Update Stats Dashboard** ⏱️ *5 minutes*
**Run every Sunday (or end of week):**
```bash
python scripts/generate-stats.py
```
**Copy the output:**
```
## 📊 All-Time Stats
| Platform | Easy | Medium | Hard | Total |
|---------------|------|--------|------|-------|
| Codewars | 5 | 2 | 0 | 7 |
| Leetcode | 12 | 8 | 1 | 21 |
| **Total** | **17** | **10** | **1** | **28** |
📅 Last Updated: 2026-01-05 20:30
```
**Paste it into:**
```bash
code logs/README.md
```
**Replace the old stats section** with the new output.
**Also update:**
```markdown
## 🔥 Current Streaks
- **Daily Practice:** X days ← UPDATE THIS MANUALLY
```
**Commit:**
```bash
git add logs/README.md
git commit -m "📊 Update weekly practice stats"
git push
```
---
## ✅ **Quick Reference Checklist**
Print this and keep it next to you:
```
□ Step 1: ./scripts/new-day.sh (once per day)
For each problem:
□ Step 2: ./scripts/create-problem.sh ""
□ Step 3: Write solution in solution.sql or solution.py
□ Step 4: Fill in notes.md:
□ Change title
□ Update metadata (platform, difficulty, time, status)
□ Paste problem URL
□ Check topic tags
□ Paste problem statement
□ Write approach & strategy
□ Paste solution code
□ Add complexity analysis
□ Write key learnings (MOST IMPORTANT!)
□ Link concept note (if created)
□ Step 5: Update logs/2026/01-january.md:
□ Time spent today
□ Add problem to "Completed" list
□ Write today's learnings
□ Plan tomorrow's focus
□ Step 6: git add → commit → push
Weekly:
□ Sunday: Run generate-stats.py
□ Update logs/README.md & the monthly log + practice/README.md with new stats
```
---
## 💡 **Time-Saving Tips**
### **Minimal Version (10 min per problem)**
If you're short on time, focus on:
1. ✅ Solution code (`solution.sql`)
2. ✅ Key learnings in `notes.md`
3. ✅ Daily log entry
Skip the rest for now, come back later to fill in.
---
### **Batch Update**
If you solve multiple problems:
1. Scaffold all problems first
2. Solve all problems
3. Update all `notes.md` files
4. Update daily log once (list all problems)
5. Single commit at the end
---
### **Use Snippets/Shortcuts**
Create editor snippets for repetitive sections like complexity analysis, common tags, etc.
---
## 🎯 **Summary: What You MUST Do Manually**
| File | What to Update |
|------|----------------|
| **solution.sql** | Your code |
| **notes.md** | Title, metadata, URL, approach, learnings |
| **logs/YYYY/MM-month.md** | Time, problems list, learnings, tomorrow's plan |
| **logs/README.md** | Weekly stats (copy from script output) |
**Everything else is automated!** 🎉
### Features
- **Automation Scripts:** Quickly scaffold new problems and logs with templates
- **Platform-Agnostic:** Automatically discovers and tracks any coding platform
- **Comprehensive Templates:** Detailed templates for problems, concepts, and daily logs
- **Progress Tracking:** Statistics generation and progress dashboards
- **Knowledge Base:** Structured concept notes linked to practice problems
See [practice/README.md](practice/README.md) for detailed usage instructions and workflow.
---