{"id":17657462,"url":"https://github.com/danielbeach/data-engineering-practice","last_synced_at":"2025-05-14T07:08:15.707Z","repository":{"id":37730380,"uuid":"460464085","full_name":"danielbeach/data-engineering-practice","owner":"danielbeach","description":"Data Engineering Practice Problems","archived":false,"fork":false,"pushed_at":"2025-01-08T21:38:03.000Z","size":47433,"stargazers_count":2031,"open_issues_count":1,"forks_count":570,"subscribers_count":40,"default_branch":"main","last_synced_at":"2025-04-10T09:55:55.199Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/danielbeach.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-02-17T14:13:36.000Z","updated_at":"2025-04-10T09:25:11.000Z","dependencies_parsed_at":"2025-01-22T18:00:42.619Z","dependency_job_id":"a9c629a4-0a5d-4cf6-8727-2badc95c2a1b","html_url":"https://github.com/danielbeach/data-engineering-practice","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielbeach%2Fdata-engineering-practice","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielbeach%2Fdata-engineering-practice/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielbeach%2Fdata-engineering-practice/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielbeach%2Fdata-engineering-practice/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/danielbeach","download_url":"https://codeload.github.com/danielbeach/data-engineering-practice/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254092657,"owners_count":22013290,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-23T14:40:49.618Z","updated_at":"2025-05-14T07:08:15.650Z","avatar_url":"https://github.com/danielbeach.png","language":"Python","readme":"## Data Engineering Practice Problems\n\nOne of the main obstacles of Data Engineering is the large\nand varied technical skills that can be required on a \nday-to-day basis.\n\n*** Note - If you email a link to your GitHub repo with all the completed\nexercises, I will send you back a free copy of my ebook Introduction to Data Engineering. ***\n\nThis aim of this repository is to help you develop and \nlearn those skills. Generally, here are the high level\ntopics that these practice problems will cover.\n\n- Python data processing.\n- csv, flat-file, parquet, json, etc.\n- SQL database table design.\n- Python + Postgres, data ingestion and retrieval.\n- PySpark\n- Data cleansing / dirty data.\n\n### How to work on the problems.\nYou will need two things to work effectively on most all\nof these problems. \n- `Docker`\n- `docker-compose`\n\nAll the tools and technologies you need will be packaged\n  into the `dockerfile` for each exercise.\n\nFor each exercise you will need to `cd` into that folder and\nrun the `docker build` command, that command will be listed in\nthe `README` for each exercise, follow those instructions.\n\n### Beginner Exercises\n\n#### Exercise 1 - Downloading files.\nThe [first exercise](https://github.com/danielbeach/data-engineering-practice/tree/main/Exercises/Exercise-1) tests your ability to download a number of files\nfrom an `HTTP` source and unzip them, storing them locally with `Python`.\n`cd Exercises/Exercise-1` and see `README` in that location for instructions.\n\n#### Exercise 2 - Web Scraping + Downloading + Pandas\nThe [second exercise](https://github.com/danielbeach/data-engineering-practice/tree/main/Exercises/Exercise-2) \ntests your ability perform web scraping, build uris, download files, and use Pandas to\ndo some simple cumulative actions.\n`cd Exercises/Exercise-2` and see `README` in that location for instructions.\n\n#### Exercise 3 - Boto3 AWS + s3 + Python.\nThe [third exercise](https://github.com/danielbeach/data-engineering-practice/tree/main/Exercises/Exercise-3) tests a few skills.\nThis time we  will be using a popular `aws` package called `boto3` to try to perform a multi-step\nactions to download some open source `s3` data files.\n`cd Exercises/Exercise-3` and see `README` in that location for instructions.\n\n#### Exercise 4 - Convert JSON to CSV + Ragged Directories.\nThe [fourth exercise](https://github.com/danielbeach/data-engineering-practice/tree/main/Exercises/Exercise-4) \nfocuses more file types `json` and `csv`, and working with them in `Python`.\nYou will have to traverse a ragged directory structure, finding any `json` files\nand converting them to `csv`.\n\n#### Exercise 5 - Data Modeling for Postgres + Python.\nThe [fifth exercise](https://github.com/danielbeach/data-engineering-practice/tree/main/Exercises/Exercise-5) \nis going to be a little different than the rest. In this problem you will be given a number of\n`csv` files. You must create a data model / schema to hold these data sets, including indexes,\nthen create all the tables inside `Postgres` by connecting to the database with `Python`.\n\n\n### Intermediate Exercises\n\n#### Exercise 6 - Ingestion and Aggregation with PySpark.\nThe [sixth exercise](https://github.com/danielbeach/data-engineering-practice/tree/main/Exercises/Exercise-6) \nIs going to step it up a little and move onto more popular tools. In this exercise we are going\nto load some files using `PySpark` and then be asked to do some basic aggregation.\nBest of luck!\n\n#### Exercise 7 - Using Various PySpark Functions\nThe [seventh exercise](https://github.com/danielbeach/data-engineering-practice/tree/main/Exercises/Exercise-7) \nTaking a page out of the previous exercise, this one is focus on using a few of the\nmore common build in PySpark functions `pyspark.sql.functions` and applying their\nusage to real-life problems.\n\nMany times to solve simple problems we have to find and use multiple functions available\nfrom libraries. This will test your ability to do that.\n\n#### Exercise 8 - Using DuckDB for Analytics and Transforms.\nThe [eighth exercise](https://github.com/danielbeach/data-engineering-practice/tree/main/Exercises/Exercise-8) \nUsing new tools is imperative to growing as a Data Engineer. DuckDB is one of those new tools. In this\nexercise you will have to complete a number of analytical and transformation tasks using DuckDB. This\nwill require an understanding of the functions and documenation of DuckDB.\n\n#### Exercise 9 - Using Polars lazy computation.\nThe [ninth exercise](https://github.com/danielbeach/data-engineering-practice/tree/main/Exercises/Exercise-9) \nPolars is a new Rust based tool with a wonderful Python package that has taken Data Engineering by\nstorm. It's better than Pandas because it has both SQL Context and supports Lazy evalutation \nfor larger than memory data sets! Show your Lazy skills!\n\n\n### Advanced Exercises\n\n#### Exercise 10 - Data Quality with Great Expectations\nThe [tenth exercise](https://github.com/danielbeach/data-engineering-practice/tree/main/Exercises/Exercise-10) \nThis exercise is to help you learn Data Quality, specifically a tool called Great Expectations. You will\nbe given an existing datasets in CSV format, as well as an existing pipeline. There is a data quality issue \nand you will be asked to implement some Data Quality checks to catch some of these issues.","funding_links":[],"categories":["Python","Uncategorized"],"sub_categories":["Uncategorized"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanielbeach%2Fdata-engineering-practice","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanielbeach%2Fdata-engineering-practice","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanielbeach%2Fdata-engineering-practice/lists"}