{"id":27443907,"url":"https://github.com/josephmachado/docker_for_data_engineers","last_synced_at":"2026-03-12T14:39:26.205Z","repository":{"id":235210182,"uuid":"790299704","full_name":"josephmachado/docker_for_data_engineers","owner":"josephmachado","description":"Code for blog at: https://www.startdataengineering.com/post/docker-for-de/","archived":false,"fork":false,"pushed_at":"2024-04-29T18:19:14.000Z","size":574,"stargazers_count":36,"open_issues_count":0,"forks_count":15,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-15T02:58:14.195Z","etag":null,"topics":["apachespark","docker","docker-compose","pyspark","pyspark-notebook"],"latest_commit_sha":null,"homepage":"https://www.startdataengineering.com/post/docker-for-de/","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/josephmachado.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-22T16:19:19.000Z","updated_at":"2025-04-14T16:55:13.000Z","dependencies_parsed_at":"2024-04-22T17:45:55.828Z","dependency_job_id":"8a7d9a3a-0969-4093-87d2-c4dac05337f7","html_url":"https://github.com/josephmachado/docker_for_data_engineers","commit_stats":null,"previous_names":["josephmachado/docker_for_data_engineers"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/josephmachado/docker_for_data_engineers","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fdocker_for_data_engineers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fdocker_for_data_engineers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fdocker_for_data_engineers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fdocker_for_data_engineers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/josephmachado","download_url":"https://codeload.github.com/josephmachado/docker_for_data_engineers/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fdocker_for_data_engineers/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279005910,"owners_count":26083982,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-10T02:00:06.843Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apachespark","docker","docker-compose","pyspark","pyspark-notebook"],"created_at":"2025-04-15T02:58:10.760Z","updated_at":"2025-10-11T01:05:47.971Z","avatar_url":"https://github.com/josephmachado.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Docker for Data Engineers\n\nCode for blog at: https://www.startdataengineering.com/post/docker-for-de/\n\nIn order to run the code in this post you'll need to install the following:\n \n1. [git version \u003e= 2.37.1](https://github.com/git-guides/install-git)\n2. [Docker version \u003e= 20.10.17](https://docs.docker.com/engine/install/) and [Docker compose v2 version \u003e= v2.10.2](https://docs.docker.com/compose/#compose-v2-and-the-new-docker-compose-command).\n\n**Windows users**: please setup WSL and a local Ubuntu Virtual machine following **[the instructions here](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-10#1-overview)**. Install the above prerequisites on your ubuntu terminal; if you have trouble installing docker, follow **[the steps here](https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04#step-1-installing-docker)** (only Step 1 is necessary). Please install the **make** command with `sudo apt install make -y` (if its not already present). \n\nAll the commands shown below are to be run via the terminal (use the Ubuntu terminal for WSL users).\n\n```bash\ngit clone https://github.com/josephmachado/docker_for_data_engineers.git\ncd docker_for_data_engineers\n# Build our custom image based off of our local Dockerfile\ndocker compose build spark-master\n# start containers\ndocker compose up --build -d --scale spark-worker=2\ndocker ps # see list of running docker containers and their settings\n# stop containers\ndocker compose down\n```\n\nUsing the exec command, you can submit commands to be run in a specific container. For example, we can use the following to open a bash terminal in our `spark-master` container:\n\n```bash\ndocker exec -ti spark-master bash\n# You will be in the master container bash shell\nexit # exit the container\n```\n\nNote that the `-ti` indicates that this will be run in an interactive mode. As shown below, we can run a command without interactive mode and get an output.\n\n```bash\ndocker exec spark-master echo hello\n# prints hello\n```\n\n## Running a Jupyter notebook\n\nUse the following command to start a jupyter server:\n\n```bash\ndocker exec spark-master bash -c \"jupyter notebook --ip=0.0.0.0 --port=3000 --allow-root\"\n```\n\nYou will see a link displayed with the format `http://127.0.0.1:3000/?token=your-token`, click it to open the jupyter notebook on your browser. You can use [local jupyter notebook sample to get started](./sample_jupyter_spark_nb.ipynb).\n\nYou can stop the jupyter server with ctrl + c.\n\n## Running on GitHub codespaces\n\n**Important**❗ Make sure you shut down your codespace instance, they can cost money (see: [pricing ref](https://github.com/features/codespaces)).\n\nYou can run our data infra in a GitHub Codespace container as shown below.\n\n1. Clone this repo, and click on `Code` -\u003e `Codespaces` -\u003e `Create codespace on main` in the GitHub repo page.\n2. In the codespace start the docker containers with `docker compose up --build -d` note that we skip the num workers, since we don't want to tax the codespace VM.\n3. Run commands as you would in your terminal.\n\n![Start codespace](./assets/cs-1.png)\n![Run ETL on codespace](./assets/cs-2.png)\n\n**Note** If you want to use Jupyter notebook via codespace forward the port 3000 following the steps [here](https://docs.github.com/en/codespaces/developing-in-a-codespace/forwarding-ports-in-your-codespace#forwarding-a-port)\n   \n# Testing PySpark Applications\n\nCode for blog at: https://www.startdataengineering.com/post/test-pyspark/\n\n## Create fake upstream data\n\nIn our upstream (postgres db), we can create fake data with the [datagen.py](./capstone/upstream_datagen/datagen.py) script, as shown:\n\n```bash\ndocker exec spark-master bash -c \"python3 /opt/spark/work-dir/capstone/upstream_datagen/datagen.py\"\n```\n\n## Run simple etl\n\n```bash\ndocker exec spark-master spark-submit --master spark://spark-master:7077 --deploy-mode client /opt/spark/work-dir/etl/simple_etl.py\n```\n\n## Run tests\n\n```bash\ndocker exec spark-master bash -c 'python3 -m pytest --log-cli-level info -p no:warnings -v /opt/spark/work-dir/etl/tests'\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosephmachado%2Fdocker_for_data_engineers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjosephmachado%2Fdocker_for_data_engineers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosephmachado%2Fdocker_for_data_engineers/lists"}