{"id":27443874,"url":"https://github.com/josephmachado/de_project","last_synced_at":"2025-04-15T02:58:01.198Z","repository":{"id":257619935,"uuid":"858828036","full_name":"josephmachado/de_project","owner":"josephmachado","description":"Step by step instructions to create a production-ready data pipeline","archived":false,"fork":false,"pushed_at":"2024-12-23T07:03:34.000Z","size":4521,"stargazers_count":44,"open_issues_count":0,"forks_count":12,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-15T02:57:56.938Z","etag":null,"topics":["dataengineering","datapipeline","python"],"latest_commit_sha":null,"homepage":"https://www.startdataengineering.com/post/de-proj-step-by-step/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/josephmachado.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-17T15:50:22.000Z","updated_at":"2025-04-04T23:35:43.000Z","dependencies_parsed_at":"2024-09-17T19:55:34.704Z","dependency_job_id":"498be154-d796-42e9-b891-94da665d439f","html_url":"https://github.com/josephmachado/de_project","commit_stats":null,"previous_names":["josephmachado/de_project"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fde_project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fde_project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fde_project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fde_project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/josephmachado","download_url":"https://codeload.github.com/josephmachado/de_project/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248997095,"owners_count":21195797,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataengineering","datapipeline","python"],"created_at":"2025-04-15T02:58:00.700Z","updated_at":"2025-04-15T02:58:01.192Z","avatar_url":"https://github.com/josephmachado.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"* [Build a data engineering project, with step-by-step instructions](#build-a-data-engineering-project-with-step-by-step-instructions)\n     * [Data used](#data-used)\n     * [Architecture](#architecture)\n     * [Setup](#setup)\n          * [Option 1: Github codespaces (Recommended)](#option-1-github-codespaces-recommended)\n          * [Option 2: Run locally](#option-2-run-locally)\n\n# Build a data engineering project, with step-by-step instructions\n\n* Code for the blog: **[Build data engineering projects with step-by-step instruction](https://www.startdataengineering.com/post/de-proj-step-by-step/)**\n* **Live workshop link**\n\n  [![Live workshop](https://img.youtube.com/vi/bfiOLwp1aWM/0.jpg)](https://www.youtube.com/live/bfiOLwp1aWM)\n\n\n## Data used \n\nLet's assume we are working with a car part seller database (tpch). The data is available in a duckdb database. See the data model below:\n\n![TPCH data model](./assets/images/tpch_erd.png)\n\nWe can create fake input data using the [create_input_data.py](https://github.com/josephmachado/de_project/blob/main/setup/create_input_data.py).\n\n## Architecture\n\nMost data teams have their version of the 3-hop architecture. For example, dbt has its own version (stage, intermediate, mart), and Spark has medallion (bronze, silver, gold) architecture.\n\n![Data Flow](./assets/images/dep-arch.png)\n\n**Tools used:**\n\n1. [\u003cimg src=\"https://raw.githubusercontent.com/pola-rs/polars-static/master/banner/polars_github_banner.svg\" height=\"50\" alt=\"Polars logo\" /\u003e](https://pola.rs/)\n2. [\u003cimg src=\"./assets/images/docker.png\" height=\"50\" alt=\"Docker logo\" /\u003e](https://www.docker.com/)\n3. [\u003cimg src=\"./assets/images/airflow.png\" height=\"50\" alt=\"Apache Airflow logo\" /\u003e](https://airflow.apache.org/)\n4. [\u003cimg src=\"./assets/images/pytest.png\" height=\"50\" alt=\"Pytest logo\" /\u003e](https://docs.pytest.org/en/stable/)\n5. [\u003cimg src=\"./assets/images/duckdb.png\" height=\"50\" alt=\"DuckDB logo\" /\u003e](https://duckdb.org/)\n\n## Setup\n\nYou have two options to run the exercises in this repo\n\n### Option 1: Github codespaces (Recommended)\n\nSteps:\n\n1. Create [Github codespaces with this link](https://github.com/codespaces/new?skip_quickstart=true\u0026machine=basicLinux32gb\u0026repo=858828036\u0026ref=main\u0026devcontainer_path=.devcontainer%2Fdevcontainer.json\u0026geo=UsWest).\n2. Wait for Github to install the [requirements.txt](./requirements.txt). This step can take about 5minutes.\n        ![installation](./assets/images/inst.png)\n3. Now open the `setup-data-project.ipynb` and it will open in a Jupyter notebook interface. You will be asked for your kernel choice, choose `Python Environments` and then `python3.12.00 Global`.\n        ![Jupyter notebook in VScode](./assets/images/vsjupy.png)\n4. The **[setup-data-project](./setup-data-project.ipynb)** notebook that goes over how to create a data pipeline.\n5. In the terminal run the following commands to setup input data, run etl and run tests.\n\n```bash\n# setup input data\npython ./setup/create_input_data.py\n# run pipeline\npython dags/run_pipeline.py\n# run tests\npython -m pytest dags/tests/unit/test_dim_customer.py\n```\n\n### Option 2: Run locally\n\nSteps:\n\n1. Clone this repo, cd into the cloned repo\n2. Start a virtual env and install requirements.\n3. Start Jupyter lab and run the `setup-data-project.ipynb` notebook that goes over how to create a data pipeline.\n```bash\ngit clone https://github.com/josephmachado/de_project.git\ncd de_project \nrm -rf env\npython -m venv ./env # create a virtual env\nsource env/bin/activate # use virtual environment\npip install -r requirements.txt\njupyter lab\n```\n4. In the terminal run the following commands to setup input data, run etl and run tests.\n\n```bash\n# setup input data\npython ./setup/create_input_data.py\n# run pipeline\npython dags/run_pipeline.py\n# run tests\npython -m pytest dags/tests/unit/test_dim_customer.py\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosephmachado%2Fde_project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjosephmachado%2Fde_project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosephmachado%2Fde_project/lists"}