{"id":30151665,"url":"https://github.com/josephmachado/data_engineering_for_beginners_code","last_synced_at":"2025-08-11T11:09:05.857Z","repository":{"id":304824297,"uuid":"1020128220","full_name":"josephmachado/data_engineering_for_beginners_code","owner":"josephmachado","description":"Code for DE101 book at https://de101.startdataengineering.com/","archived":false,"fork":false,"pushed_at":"2025-08-07T15:34:41.000Z","size":30170,"stargazers_count":36,"open_issues_count":0,"forks_count":63,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-07T17:40:18.660Z","etag":null,"topics":["airflow","dbt","python","spark","sql"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/josephmachado.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-15T11:35:28.000Z","updated_at":"2025-08-07T15:34:44.000Z","dependencies_parsed_at":"2025-07-16T04:21:50.945Z","dependency_job_id":"7773cc89-25f9-4ee1-8eae-5b123cac3e17","html_url":"https://github.com/josephmachado/data_engineering_for_beginners_code","commit_stats":null,"previous_names":["josephmachado/data_engineering_for_beginners_code"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/josephmachado/data_engineering_for_beginners_code","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fdata_engineering_for_beginners_code","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fdata_engineering_for_beginners_code/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fdata_engineering_for_beginners_code/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fdata_engineering_for_beginners_code/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/josephmachado","download_url":"https://codeload.github.com/josephmachado/data_engineering_for_beginners_code/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fdata_engineering_for_beginners_code/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269873139,"owners_count":24488993,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-11T02:00:10.019Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","dbt","python","spark","sql"],"created_at":"2025-08-11T11:02:23.051Z","updated_at":"2025-08-11T11:09:05.844Z","avatar_url":"https://github.com/josephmachado.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"* [Data Engineering for Beginners](#data-engineering-for-beginners)\n    * [Setup](#setup)\n        * [Prerequisites](#prerequisites)\n        * [Starting and stopping containers](#starting-and-stopping-containers)\n        * [Running code via Jupyter Notebooks](#running-code-via-jupyter-notebooks)\n        * [Airflow \u0026 dbt](#airflow--dbt)\n\n# Data Engineering for Beginners\n\nCode for the [Data Engineering for Beginners e-book](https://de101.startdataengineering.com/).\n\n## Setup\n\nThe code for SQL, Python, and data model sections are written using Spark SQL. To run the code, you will need the prerequisites listed below.\n\n### Prerequisites\n\n1. [git version \u003e= 2.37.1](https://github.com/git-guides/install-git)\n2. [Docker version \u003e= 20.10.17](https://docs.docker.com/engine/install/) and [Docker compose v2 version \u003e= v2.10.2](https://docs.docker.com/compose/#compose-v2-and-the-new-docker-compose-command).\n\n**Windows users**: please setup WSL and a local Ubuntu Virtual machine following **[the instructions here](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-10#1-overview)**. Install the above prerequisites on your ubuntu terminal; if you have trouble installing docker, follow **[the steps here](https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04#step-1-installing-docker)** (only Step 1 is necessary). Please install the **make** command with `sudo apt install make -y` (if it's not already present). \n\n### Starting and stopping containers\n\nFork this repository **[data_engineering_for_beginners_code](https://github.com/josephmachado/data_engineering_for_beginners_code/tree/main?tab=readme-ov-file#setup)**.                                                                      \n![GiitHub Fork](./images/fork.png)\nAfter forking, clone the repo to your local machine and start the containers as shown below:\n\n```bash\ngit clone https://github.com/your-user-name/data_engineering_for_beginners_code.git\ncd data_engineering_for_beginners_code\ndocker compose up -d # to start the docker containers\nsleep 30 \n```\n\n### Running code via Jupyter Notebooks\n\nOpen the Starter Jupyter Notebook at [http://localhost:8888/lab/tree/notebooks/starter-notebook.ipynb](http://localhost:8888/lab/tree/notebooks/starter-notebook.ipynb) and try out the commands in ther [Data Engineering for Beginners e-book](https://www.startdataengineering.com/) as shown below.\n\n![Notebook Template](./images/nb_template.png)\n\nIf you are creating a new notebook, make sure to select the `Python 3 (ipykernel)` Notebook.\n\nWhen you are done, stop docker containers with the below command:\n\n```bash\ndocker compose down \n```\n\n### Airflow \u0026 dbt\n\nFor the Airflow, dbt \u0026 capstone section, go into the `airflow` directory and run the make commands as shown below.\n\n**Note** All the code in the dbt, Airflow and capstone chapters are to be run via the terminal at `data_engineering_for_beginners_code/airflow` directory.\n\n```bash\ndocker compose down # Make sure to stop Spark/Jupyternotebook containers before turning on Airflow's \ncd airflow\nmake restart # This will ask for your password to create some folders\n```\n\nYou can open Airflow UI at [http://localhost:8080](http://localhost:8080) and log in with `airflow` as username and password. In the Airflow UI, you can run the dag.\n\nAfter the dag is run, in the terminal, run `make dbt-docs` for dbt to serve the docs, which is viewable by going to [http://localhost:8081](http://localhost:8081).\n\nYou can stop the containers \u0026 return to the parent directory as shown below:\n\n```bash\nmake down\ncd ..\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosephmachado%2Fdata_engineering_for_beginners_code","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjosephmachado%2Fdata_engineering_for_beginners_code","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosephmachado%2Fdata_engineering_for_beginners_code/lists"}