https://github.com/shiningflash/data-engineering-practice-problems

collection of real-world data engineering scenarios — short, practical exercises
https://github.com/shiningflash/data-engineering-practice-problems

data-engineering practice-problems problem-solving python

Last synced: 19 days ago
JSON representation

collection of real-world data engineering scenarios — short, practical exercises

Host: GitHub
URL: https://github.com/shiningflash/data-engineering-practice-problems
Owner: shiningflash
Created: 2025-10-11T12:29:37.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-10-23T18:56:03.000Z (3 months ago)
Last Synced: 2025-10-23T20:34:46.550Z (3 months ago)
Topics: data-engineering, practice-problems, problem-solving, python
Language: Python
Homepage:
Size: 19.5 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 🧩 Data Engineering Practice Problems

> After solving 1,500+ problems on LeetCode and Codeforces, I realized —
> **none of them prepared me for broken CSVs, delayed Kafka messages, or JSONs that lie.**

This repo is for engineers who’ve had enough of toy problems. It’s a collection of **real-world data engineering scenarios** — short, practical exercises inspired by what actually breaks in production.

---

## Why I Built This

Most practice problems test logic.
Production tests resilience.

In production, problems don’t come with test cases — they come with missing data, bad assumptions, and time pressure.

So I started collecting real scenarios I’ve seen:

* Kafka topics that send data hours late,
* CSVs with 2 million rows and 6 different date formats,
* JSON events with new fields added mid-release,
* ETL jobs that “succeed” but quietly skip records,
* Dashboards that stop updating without errors, etc.

---

## What’s Inside

And many more coming ...

> Each problem is small enough to solve in hours, but real enough to prepare you for production.

---

## Getting Started

```bash
# 1. Set up your environment
python -m venv venv && source venv/bin/activate

# 2. Use Python 3.10+
# 3. Pick a problem
# Each folder has a question.md and a reference solution.py
```

Inputs live in `data/`, outputs are generated beside them for easy inspection. Data files are excluded intentionally to keep the repo lightweight.

---

## How to Contribute

If you’ve debugged a broken pipeline,
caught a silent bug before it spread,
built a clever patch that saved a release
or found a way to clean a 5 GB CSV in one pass —
your story belongs here.

Add a new scenario, or improve an existing one.
See the [Contribution Guide](CONTRIBUTION.md) for details.

---

> **The goal isn’t to practice coding.**
> It’s to practice *judgment* — the kind that keeps systems running when logic alone isn’t enough.

⭐ Star the repo if you’ve ever learned more from production than from tutorials.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shiningflash/data-engineering-practice-problems

Awesome Lists containing this project

README