{"id":34851444,"url":"https://github.com/shiningflash/data-engineering-practice-problems","last_synced_at":"2026-05-25T15:01:43.053Z","repository":{"id":320434341,"uuid":"1074247955","full_name":"shiningflash/data-engineering-practice-problems","owner":"shiningflash","description":"collection of real-world data engineering scenarios — short, practical exercises","archived":false,"fork":false,"pushed_at":"2025-10-23T18:56:03.000Z","size":20,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-23T20:34:46.550Z","etag":null,"topics":["data-engineering","practice-problems","problem-solving","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shiningflash.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-11T12:29:37.000Z","updated_at":"2025-10-23T18:56:06.000Z","dependencies_parsed_at":"2025-10-23T20:36:44.111Z","dependency_job_id":"991a580c-2d0c-4e0a-99e9-9324b762fa0c","html_url":"https://github.com/shiningflash/data-engineering-practice-problems","commit_stats":null,"previous_names":["shiningflash/data-engineering-practice-problems"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/shiningflash/data-engineering-practice-problems","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shiningflash%2Fdata-engineering-practice-problems","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shiningflash%2Fdata-engineering-practice-problems/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shiningflash%2Fdata-engineering-practice-problems/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shiningflash%2Fdata-engineering-practice-problems/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shiningflash","download_url":"https://codeload.github.com/shiningflash/data-engineering-practice-problems/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shiningflash%2Fdata-engineering-practice-problems/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28035466,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-25T02:00:05.988Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","practice-problems","problem-solving","python"],"created_at":"2025-12-25T19:19:58.282Z","updated_at":"2026-05-25T15:01:42.936Z","avatar_url":"https://github.com/shiningflash.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🧩 Data Engineering Practice Problems\n\n\u003e After solving 1,500+ problems on LeetCode and Codeforces, I realized,\n\u003e **none of them prepared me for broken CSVs, delayed Kafka messages, or JSONs that lie.**\n\nThis repo is for engineers who’ve had enough of toy problems. It’s a collection of **real-world data engineering scenarios**. Short, practical exercises inspired by what actually breaks in production.\n\n---\n\n## Why I Built This\n\nMost practice problems test logic.\nProduction tests resilience.\n\nIn production, problems don’t come with test cases, they come with missing data, bad assumptions, and time pressure.\n\nSo I started collecting real scenarios I’ve seen:\n\n* Kafka topics that send data hours late,\n* CSVs with 2 million rows and 6 different date formats,\n* JSON events with new fields added mid-release,\n* ETL jobs that “succeed” but quietly skip records,\n* Dashboards that stop updating without errors, etc.\n\n---\n\n## What’s Inside\n\n| Category | Scenario | What You’ll Practice |\n|:----------------------- |:----------------------------------------------- |:------------------------------------------- |\n| **Late Data** | 10 GB of IoT logs arriving out of order | Handle streaming delays without duplication |\n| **Schema Drift** | JSON events adding new fields mid-release | Validate and evolve safely |\n| **ETL Reliability** | Long-running jobs silently skipping records | Detect silent corruptions before they spread |\n| **Data Hygiene** | Partner CSVs with missing headers and fake nulls | Clean data in one pass and log every fix |\n| **Rolling Analytics** | Continuous sensor feeds with infinite rows | Keep rolling metrics in memory without dying |\n\nAnd many more coming...\n\n\u003e Each problem is small enough to solve in hours, but real enough to prepare you for production.\n\n---\n\n## Getting Started\n\n```bash\n# 1. Set up your environment\npython -m venv venv \u0026\u0026 source venv/bin/activate\n\n# 2. Use Python 3.10+\n# 3. Browse PROBLEMS.md for the full index, or open any folder in problems/.\n#    Each problem lives in problems/NNN-slug/ and contains:\n#      - question.md   (problem statement, with YAML frontmatter)\n#      - solution.md   (written walkthrough)  OR  solution.py (runnable code)\n```\n\nInputs live in `data/`, outputs are generated beside them for easy inspection. Data files are excluded intentionally to keep the repo lightweight.\n\n### Repo layout\n\n```\nproblems/\n  001-log-file-error-analysis/\n    question.md\n    solution.py\n  ...\ndata/                # sample input files (gitignored where large)\nscripts/\n  build_index.py     # regenerates PROBLEMS.md from question.md frontmatter\nPROBLEMS.md          # generated index — do not edit by hand\n```\n\n---\n\n## How to Contribute\n\nIf you’ve debugged a broken pipeline,\ncaught a silent bug before it spread,\nbuilt a clever patch that saved a release\nor found a way to clean a 5 GB CSV in one pass,\nyour story belongs here.\n\nAdd a new scenario, or improve an existing one.\nSee the [Contribution Guide](CONTRIBUTION.md) for details.\n\n---\n\n\u003e **The goal isn’t to practice coding.**\n\u003e It’s to practice *judgment*, the kind that keeps systems running when logic alone isn’t enough.\n\n⭐ Star the repo if you’ve ever learned more from production than from tutorials.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshiningflash%2Fdata-engineering-practice-problems","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshiningflash%2Fdata-engineering-practice-problems","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshiningflash%2Fdata-engineering-practice-problems/lists"}