An open API service indexing awesome lists of open source software.

https://github.com/pizofreude/data-engineer-notes

Personal notes and lab solutions for the Data Engineer Handbook Bootcamp
https://github.com/pizofreude/data-engineer-notes

apache-flink apache-spark communication data-engineering data-quality database dimensional-data-modeling fact-data-modeling impact kafka

Last synced: 2 months ago
JSON representation

Personal notes and lab solutions for the Data Engineer Handbook Bootcamp

Awesome Lists containing this project

README

          

# Data Engineer Bootcamp Notes

Welcome to my personal notes for the **Data Engineer Handbook Bootcamp**.

This repository is my learning journal, containing summaries, key concepts, and lab solutions for the 6-week bootcamp. It complements my forked [Data Engineer Handbook repo](https://github.com/pizofreude/data-engineer-handbook).

---

## Repository Structure

```bash
data-engineer-notes/
├── README.md
├── resources.md
├── assets/
├── images/
├── week00/
│ ├── summary.md
│ ├── key-concepts.md
│ ├── lab-notes.md
│ └── lab00/
│ ├── solution.ipynb
│ └── ... # Artifacts from bootcamp materials
├── week01/
│ └── (similar structure)
├── ...
└── week06/
└── (similar structure)
```

### Week Notes
Each week contains:
- **Summary:** Key takeaways from the week
- **Key Concepts:** Detailed explanations and examples of core ideas
- **Lab Notes:** Observations, detailed notes, and troubleshooting during labs
- **Labs:** Solutions for each lab

---

## Links

- **Original Handbook:** [DataExpert-io/data-engineer-handbook](https://github.com/DataExpert-io/data-engineer-handbook)
- **My Fork:** [pizofreude/data-engineer-handbook](https://github.com/pizofreude/data-engineer-handbook)

---

## Learning Progress

- [X] Module 1: Bootcamp Orientation - Database setup and Boot Camp Kickoff [Week 0]
- [ ] Bootcamp Kickoff | 20 min
- [ ] Boot Camp Database Setup | 20 min
- [ ] Module 2: Dimensional Data Modeling [Week 1]
- [ ] Dimensional Data Modeling Complex Data Type and Cumulation Day 1 Lecture | 43 min
- [ ] Dimensional Data Modeling Complex Data Type and Cumulation Day 1 Lab | 41 min
- [ ] Dimensional Data Modeling: Building Slowly Changing Dimensions Day 2 Lecture | 40 min
- [ ] Dimensional Data Modeling: Building Slowly Changing Dimensions Day 2 Lab | 45 min
- [ ] Dimensional Data Modeling: Graph Data Modeling Day 3 Lecture | 34 min
- [ ] Dimensional Data Modeling: Graph Data Modeling Day 3 Lab | 46 min
- [ ] Dimensional Data Modeling - Week 1 Assignment
- [ ] Module 3: Fact Data Modeling [Week 2]
- [ ] Fact Data Modeling: Core Concepts, Deduplication Day 1 Lecture | 52 min
- [ ] Fact Data Modeling: Practical Insights into Data Modeling Day 1 Lab | 40 min
- [ ] Fact Data Modeling: Core Elements in Data Modeling Day 2 Lecture | 31 min
- [ ] Fact Data Modeling: Compact Tables for Efficient Data Representation Day 2 Lab | 45 min
- [ ] Fact Data Modeling: Minimizing Shuffle and Reducing Facts Day 3 Lecture | 32 min
- [ ] Fact Data Modeling: Practical Guide to Formatting and Aggregating Data Day 3 Lab | 30 min
- [ ] Fact Data Modeling - Week 2 Assignment
- [ ] Module 4: Apache Spark Fundamentals [Week 3]
- [ ] Apache Spark: Architecture, Optimization, and Best Practices Day 1 Lecture | 48 min
- [ ] Apache Spark: Hands-On for Broadcast and Hash Joins Day 1 Lab | 26 min
- [ ] Apache Spark: Managing Spark Jobs and Notebooks Day 2 Lecture | 34 min
- [ ] Apache Spark: User-Defined Functions and Broadcast Join Day 2 Lab | 36 min
- [ ] Unit Testing Spark Jobs: Importance, Challenges, and Leadership Perspectives Lecture | 41 min
- [ ] Unit Testing Spark Jobs: Mastering Spark and PySpark Testing Lab | 27 min
- [ ] Spark Fundamentals - Week 3 Assignment
- [ ] Module 5: Applying Analytical Patterns [Week 4]
- [ ] Applying Analytical Patterns: Exploring SQL, Scaling Projects and Aggregation Analysis Day 1 Lecture | 52 min
- [ ] Applying Analytical Patterns: Mastering Growth Accounting and Retention Analysis Day 1 Lab | 34 min
- [ ] Applying Analytical Patterns: Recursive CTEs and Window Functions Day 2 Lecture | 44 min
- [ ] Applying Analytical Patterns: Aggregations and Cardinality Reduction Day 2 Lab | 33 min
- [ ] Applying Analytical Patterns - Week 4 Assignment
- [ ] Module 6: Real-time pipelines with Flink and Kafka [Week 5]
- [ ] Flink Lab Setup | 7 min
- [ ] Streaming Pipelines: Mastering Streaming and Real-time Pipelines Day 1 Lecture | 50 min
- [ ] Streaming Pipelines: Setting up Streaming Pipelines Day 1 Lab | 40 min
- [ ] Streaming Pipelines: Exploring Data Collection and Processing Day 2 Lecture | 31 min
- [ ] Streaming Pipelines: Kafka, Postgres, Spark Integrations and Parallelism Day 2 Lab | 39 min
- [ ] Flink - Week 5 Assignment
- [ ] Module 7: Data Visualization and Impact [Week 6 Part 1]
- [ ] Data Visualization and Impact: Mastering Data Engineering Day 1 Lecture | 39 min
- [ ] Data Visualization and Impact: Hands-On with the CSV files Day 1 Lab | 8 min
- [ ] Data Visualization and Impact: Insights and Best Practices Day 2 Lecture | 23 min
- [ ] Data Visualization and Impact: Exploring Data Visualization and Aggregation Techniques Day 2 Lab | 37 min
- [ ] Data Visualization - Week 6 1st Assignment
- [ ] Module 8: Data Pipeline Maintenance [Week 6 Part 2]
- [ ] Data Pipeline Maintenance: Navigating the Complexities of Data Engineering Day 1 Lecture | 67 min
- [ ] Data Pipeline Maintenance: Strategies for Maintenance and Dock Building Day 2 Lecture | 77 min
- [ ] Data Pipeline Maintenance - Week 6 2nd Assignment
- [ ] Module 9: KPIs and Experimentation [Week 6 Part 3]
- [ ] KPIs and Experimentation: Decoding Business Success: Metrics, Growth Strategies and Collaborative Approaches Day 1 Lecture | 55 min
- [ ] KPIs and Experimentation: Setting up and Analysing Experiments Day 1 Lab | 36 min
- [ ] KPIs and Experimentation: Leading and Lagging Metrics Day 2 Lecture | 65 min
- [ ] KPIs and Experimentation - Week 6 3rd Assignment
- [ ] Module 10: Data Quality Patterns [Week 7]
- [ ] Data Quality Patterns: MIDAS Process from Airbnb Day 1 Lecture | 45 min
- [ ] Data Quality Patterns: Spec-Building Document Day 1 Lab | 33 min
- [ ] Data Quality Patterns: WAP Patterns Day 2 Lecture | 27 min

---

## 💻 Daily Practice System

This repository now includes a comprehensive practice tracking system to organize daily coding practice across multiple platforms:

- **[practice/](practice/)** - Platform-organized coding problems (LeetCode, StrataScratch, HackerRank, NeetCode, Codewars, etc.)
- **[concepts/](concepts/)** - Reference notes on data structures, algorithms, SQL patterns, and system design
- **[interview-prep/](interview-prep/)** - Interview-specific preparation materials (behavioral, technical, system design)
- **[logs/](logs/)** - Daily practice logs and progress tracking with statistics dashboard

### Quick Start

```bash
# Create today's log entry
./scripts/new-day.sh

# Start a new problem
# ./scripts/create-problem.sh "problem-name"
./scripts/create-problem.sh leetcode medium "problem-name"

# Create a concept note
# ./scripts/link-concept.sh "concept-name"
./scripts/link-concept.sh "Window Functions" sql-patterns

# Generate weekly stats
python scripts/generate-stats.py
```

### ✅ Manual Update Checklist: For Every Problem You Solve

Here's your **streamlined checklist** for logging each problem.

#### 📋 **The 5-Step Workflow**

##### **Step 1: Start Your Day** ⏱️ *30 seconds*

**Run once per day (first thing in the morning):**

```bash
./scripts/new-day.sh
```

**✅ Done!** No manual edits needed for this step.

---

##### **Step 2: Scaffold the Problem** ⏱️ *30 seconds*

**For each new problem you're about to solve:**

```bash
./scripts/create-problem.sh ""
```

**Examples:**
```bash
./scripts/create-problem.sh codewars easy "absolute-value-log-base"
./scripts/create-problem.sh leetcode medium "rank-scores"
./scripts/create-problem.sh stratascratch hard "revenue-analysis"
```

**✅ Done!** Folder created, template copied, ready to code.

---

##### **Step 3: Write Your Solution** ⏱️ *10-30 minutes (solving time)*

Navigate to the problem folder:

```bash
cd practice///
```

**Open and write your solution:**

```bash
code solution. sql # For SQL problems
# OR
code solution.py # For Python/algorithm problems
```

**What to do:**
- ✏️ **Paste your working solution code**
- ✏️ **Add comments explaining key logic** (optional but recommended)
- 💾 **Save the file**

**Example:**
```sql
-- Calculate absolute value and logarithm base 64
SELECT
ABS(number1) AS abs,
LOG(64, number2) AS log
FROM decimals;
```

---

##### **Step 4: Document Your Solution** ⏱️ *10-15 minutes*

**Open the notes file:**

```bash
code notes.md
```

**You need to manually update these sections:**

##### **A. Metadata (Top of file)**
```markdown
# [Problem Name] ← CHANGE THIS

## 📋 Metadata
- **Platform:** [Platform name] ← CHANGE THIS
- **Difficulty:** [Easy/Medium/Hard] (Optional: add platform rating like "7 kyu") ← CHANGE THIS
- **Date Solved:** 2026-01-03 ← ✅ ALREADY FILLED BY SCRIPT
- **Time Spent:** XX minutes ← CHANGE THIS
- **Status:** [✅ Solved | 🔄 Revisit | ❌ Stuck] ← CHANGE THIS
```

**Example:**
```markdown
# Absolute Value and Log to Base

## 📋 Metadata
- **Platform:** Codewars
- **Difficulty:** Easy (7 kyu)
- **Date Solved:** 2026-01-03 ← Script filled this
- **Time Spent:** 15 minutes
- **Status:** ✅ Solved
```

---

##### **B. Links**
```markdown
## 🔗 Links
- [Problem URL] ← PASTE THE ACTUAL URL HERE
```

**Example:**
```markdown
## 🔗 Links
- https://www.codewars.com/kata/594a8f2f7ca3c692a4000041/train/sql
```

---

##### **C. Topics & Tags (Check the boxes)**
```markdown
## 📚 Topics & Tags
- [ ] SQL
- [ ] Window Functions
- [ ] Joins
- [ ] CTEs
- [ ] Python
- [ ] Dynamic Programming
```

**Check the relevant ones:**
```markdown
##📚 Topics & Tags
- [x] SQL ← Put 'x' inside
- [x] Mathematical Functions
- [ ] Window Functions
- [ ] Joins
```

---

##### **D. Problem Statement**
```markdown
## 📝 Problem Statement
[Paste the problem description here]

### Example Input/Output
```markdown
Input:
Output:
```

**What to do:**
- ✏️ Copy-paste the problem description from the platform
- ✏️ Add example input/output (if provided)

---

##### **E. Approach**
```markdown
## 💡 Approach

### Initial Thoughts
[What was your first idea? What patterns did you recognize?]

### Solution Strategy
1. Step 1
2. Step 2
3. Step 3
```

**What to do:**
- ✏️ Write your thought process (2-3 sentences)
- ✏️ List the steps you took (bullet points)

**Example:**
```markdown
## 💡 Approach

### Initial Thoughts
Straightforward application of SQL math functions: ABS for absolute value, LOG for logarithm with custom base.

### Solution Strategy
1. Use `ABS(number1)` to get absolute values
2. Use `LOG(64, number2)` for logarithm base 64
3. Alias columns as required (`abs`, `log`)
```

---

##### **F. Solution**
```markdown
## 🖥️ Solution

### Attempt 1 (Initial)
```sql
-- Your first solution here
```

**Result:** [Passed/Failed/Timeout]
```

**What to do:**
- ✏️ Paste your solution code (can be same as `solution.sql`)
- ✏️ Note if it passed or failed

**If you optimized it, add:**
```markdown
### Attempt 2 (Optimized) ⭐
```sql
-- Improved solution
```

**Result:** ✅ Passed with better performance
```

---

#### **G. Complexity Analysis**
```markdown
## ⚡ Complexity Analysis
- **Time Complexity:** O(?)
- **Space Complexity:** O(?)
```

**What to do:**
- ✏️ Fill in the Big O notation
- ✏️ If you don't know, write: "Time: O(n) - single pass through table"

**Example:**
```markdown
## ⚡ Complexity Analysis
- **Time Complexity:** O(n) - single pass through table
- **Space Complexity:** O(n) - result set same size as input
```

---

#### **H. Key Learnings**
```markdown
## 🎓 Key Learnings
1.
2.
3.
```

**What to do:**
- ✏️ Write 2-4 things you learned (this is THE MOST IMPORTANT SECTION!)

**Example:**
```markdown
## 🎓 Key Learnings
1. **ABS()** - Returns absolute value (distance from zero)
2. **LOG(base, value)** - PostgreSQL syntax for custom base logarithm
3. PostgreSQL uses `LOG(base, value)` while MySQL uses `LOG(value) / LOG(base)`
4. Base-64 logarithm: `LOG(64, 4096) = 2` because 64² = 4096
```

---

#### **I. Related Concepts (Optional)**
```markdown
## 🏷️ Related Concepts
See: `concepts/sql-patterns/[concept-file]. md`
```

**What to do:**
- ✏️ If you created a concept note, link it here
- ⏭️ Skip if you haven't created a concept yet

**Example:**
```markdown
## 🏷️ Related Concepts
See: `concepts/sql-patterns/sql-mathematical-functions.md`
```

---

### **Step 5: Update Daily Log** ⏱️ *5 minutes*

**Open today's log:**

```bash
cd ../../../../ # Return to repo root
code logs/2026/01-january. md
```

**Find today's date section** and fill in:

#### **A. Time Spent**
```markdown
### Friday, January 03, 2026
⏱️ Time: X hours ← CHANGE THIS
```

**Example:**
```markdown
⏱️ Time: 1.5 hours
```

---

#### **B. Problems Completed**
```markdown
#### ✅ Completed
1.
```

**Add each problem with:**
- Problem name and difficulty
- Key topics
- Link to your solution
- One-line key learning

**Example:**
```markdown
#### ✅ Completed
1. **Codewars - Absolute Value and Log to Base** (Easy/7kyu)
- Topics: ABS(), LOG(), Mathematical functions
- [Solution](../../practice/codewars/easy/absolute-value-log-base/)
- Key learning: PostgreSQL LOG(base, value) syntax differs from MySQL

2. **LeetCode 178 - Rank Scores** (Medium)
- Topics: Window functions, DENSE_RANK
- [Solution](../../practice/leetcode/medium/178-rank-scores/)
- Key learning: DENSE_RANK vs RANK vs ROW_NUMBER differences
```

---

#### **C. Learnings**
```markdown
#### 💡 Learnings
-
```

**Write 2-4 broader learnings from today:**

**Example:**
```markdown
#### 💡 Learnings
- Mathematical functions in SQL are database-specific (PostgreSQL vs MySQL syntax)
- Always check for NULL values when using LOG() with user input
- ABS() is useful for calculating distances and differences
- Created concept note: `concepts/sql-patterns/sql-mathematical-functions.md`
```

---

#### **D. Tomorrow's Plan**
```markdown
#### 🎯 Tomorrow
- [ ]
```

**Plan 2-3 things for tomorrow:**

**Example:**
```markdown
#### 🎯 Tomorrow
- [ ] LeetCode 180 - Consecutive Numbers (Window functions practice)
- [ ] StrataScratch - Revenue analysis problem
- [ ] Review: Self-joins pattern
```

---

### **Step 6: Commit & Push** ⏱️ *1 minute*

```bash
git status

# Add your changes
git add practice////
git add logs/2026/01-january. md

# If you created a concept note, add it too
git add concepts/

# Commit with descriptive message
git commit -m "✅ [Platform]: [Problem Name] - [Key Topic]"

# Push to GitHub
git push
```

**Example commit messages:**
```bash
git commit -m "✅ Codewars: Absolute Value and Log to Base - SQL math functions"
git commit -m "✅ LeetCode 178: Rank Scores - Window functions"
git commit -m "✅ StrataScratch: Revenue Analysis - CTEs and aggregations"
```

---

## 📊 **Weekly: Update Stats Dashboard** ⏱️ *5 minutes*

**Run every Sunday (or end of week):**

```bash
python scripts/generate-stats.py
```

**Copy the output:**
```
## 📊 All-Time Stats

| Platform | Easy | Medium | Hard | Total |
|---------------|------|--------|------|-------|
| Codewars | 5 | 2 | 0 | 7 |
| Leetcode | 12 | 8 | 1 | 21 |
| **Total** | **17** | **10** | **1** | **28** |

📅 Last Updated: 2026-01-05 20:30
```

**Paste it into:**
```bash
code logs/README.md
```

**Replace the old stats section** with the new output.

**Also update:**
```markdown
## 🔥 Current Streaks
- **Daily Practice:** X days ← UPDATE THIS MANUALLY
```

**Commit:**
```bash
git add logs/README.md
git commit -m "📊 Update weekly practice stats"
git push
```

---

## ✅ **Quick Reference Checklist**

Print this and keep it next to you:

```
□ Step 1: ./scripts/new-day.sh (once per day)

For each problem:
□ Step 2: ./scripts/create-problem.sh ""
□ Step 3: Write solution in solution.sql or solution.py
□ Step 4: Fill in notes.md:
□ Change title
□ Update metadata (platform, difficulty, time, status)
□ Paste problem URL
□ Check topic tags
□ Paste problem statement
□ Write approach & strategy
□ Paste solution code
□ Add complexity analysis
□ Write key learnings (MOST IMPORTANT!)
□ Link concept note (if created)

□ Step 5: Update logs/2026/01-january.md:
□ Time spent today
□ Add problem to "Completed" list
□ Write today's learnings
□ Plan tomorrow's focus

□ Step 6: git add → commit → push

Weekly:
□ Sunday: Run generate-stats.py
□ Update logs/README.md & the monthly log + practice/README.md with new stats
```

---

## 💡 **Time-Saving Tips**

### **Minimal Version (10 min per problem)**
If you're short on time, focus on:
1. ✅ Solution code (`solution.sql`)
2. ✅ Key learnings in `notes.md`
3. ✅ Daily log entry

Skip the rest for now, come back later to fill in.

---

### **Batch Update**
If you solve multiple problems:
1. Scaffold all problems first
2. Solve all problems
3. Update all `notes.md` files
4. Update daily log once (list all problems)
5. Single commit at the end

---

### **Use Snippets/Shortcuts**
Create editor snippets for repetitive sections like complexity analysis, common tags, etc.

---

## 🎯 **Summary: What You MUST Do Manually**

| File | What to Update |
|------|----------------|
| **solution.sql** | Your code |
| **notes.md** | Title, metadata, URL, approach, learnings |
| **logs/YYYY/MM-month.md** | Time, problems list, learnings, tomorrow's plan |
| **logs/README.md** | Weekly stats (copy from script output) |

**Everything else is automated!** 🎉

### Features

- **Automation Scripts:** Quickly scaffold new problems and logs with templates
- **Platform-Agnostic:** Automatically discovers and tracks any coding platform
- **Comprehensive Templates:** Detailed templates for problems, concepts, and daily logs
- **Progress Tracking:** Statistics generation and progress dashboards
- **Knowledge Base:** Structured concept notes linked to practice problems

See [practice/README.md](practice/README.md) for detailed usage instructions and workflow.

---