{"id":27443898,"url":"https://github.com/josephmachado/efficient_data_processing_spark","last_synced_at":"2025-04-15T02:58:07.707Z","repository":{"id":236156551,"uuid":"723117906","full_name":"josephmachado/efficient_data_processing_spark","owner":"josephmachado","description":"Code for \"Efficient Data Processing in Spark\" Course","archived":false,"fork":false,"pushed_at":"2024-10-01T00:38:06.000Z","size":25018,"stargazers_count":292,"open_issues_count":2,"forks_count":62,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-15T02:58:02.124Z","etag":null,"topics":["apache-spark","data-engineering","data-pipeline","minio","pyspark","pyspark-notebook"],"latest_commit_sha":null,"homepage":"https://josephmachado.podia.com/efficient-data-processing-in-spark","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/josephmachado.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-24T18:24:40.000Z","updated_at":"2025-04-14T06:24:15.000Z","dependencies_parsed_at":"2024-05-05T06:32:17.536Z","dependency_job_id":"3ffb3e6d-6199-421b-be8e-dfbaa3b44a34","html_url":"https://github.com/josephmachado/efficient_data_processing_spark","commit_stats":null,"previous_names":["josephmachado/efficient_data_processing_spark"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fefficient_data_processing_spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fefficient_data_processing_spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fefficient_data_processing_spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fefficient_data_processing_spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/josephmachado","download_url":"https://codeload.github.com/josephmachado/efficient_data_processing_spark/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248997095,"owners_count":21195797,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","data-engineering","data-pipeline","minio","pyspark","pyspark-notebook"],"created_at":"2025-04-15T02:58:07.151Z","updated_at":"2025-04-15T02:58:07.699Z","avatar_url":"https://github.com/josephmachado.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Code for my [Efficient Data Processing in Spark](https://josephmachado.podia.com/efficient-data-processing-in-spark?coupon=SUBSPECIAL524) course.\n\n# Efficient Data Processing in Spark \n- [Efficient Data Processing in Spark](#efficient-data-processing-in-spark)\n  - [Setup](#setup)\n    - [Create aliases for long commands with a Makefile](#create-aliases-for-long-commands-with-a-makefile)\n    - [Run a Jupyter notebook](#run-a-jupyter-notebook)\n  - [Infrastructure](#infrastructure)\n\n\nRepository for examples and exercises from the \"Efficient Data Processing in Spark\" course (under [data-processing-spark](./data-processing-spark/)). The capstone project is also present in this repository (under [capstone/rainforest](./capstone/rainforest/)).\n\n## Setup\n\nIn order to run the project you'll need to install the following:\n \n1. [git version \u003e= 2.37.1](https://github.com/git-guides/install-git)\n2. [Docker version \u003e= 20.10.17](https://docs.docker.com/engine/install/) and [Docker compose v2 version \u003e= v2.10.2](https://docs.docker.com/compose/#compose-v2-and-the-new-docker-compose-command).\n\n**Windows users**: please setup WSL and a local Ubuntu Virtual machine following **[the instructions here](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-10#1-overview)**. Install the above prerequisites on your ubuntu terminal; if you have trouble installing docker, follow **[the steps here](https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04#step-1-installing-docker)** (only Step 1 is necessary). Please install the **make** command with `sudo apt install make -y` (if its not already present). \n\nAll the commands shown below are to be run via the terminal (use the Ubuntu terminal for WSL users). The `make` commands in this book should be run in the `efficient_data_processing_spark` folder. We will use docker to set up our containers. Clone and move into the lab repository, as shown below.\n\n**Note**: If you are using mac M1 or later, please replace the \"FROM deltaio/delta-docker:latest\" in [data-processing-spark/1-lab-setup/containers/spark/Dockerfile](./data-processing-spark/1-lab-setup/containers/spark/Dockerfile) with \"FROM deltaio/delta-docker:latest_arm64\"\n\n\n```bash\ngit clone https://github.com/josephmachado/efficient_data_processing_spark.git\ncd efficient_data_processing_spark\n# Start docker containers and create data for exercises and capstone project\n# If you are using mac M1, please replace the \"FROM deltaio/delta-docker:latest\" \n# in data-processing-spark/1-lab-setup/containers/spark/Dockerfile\n# with \"FROM deltaio/delta-docker:latest_arm64\"\nmake restart \u0026\u0026 make setup\n```\n\n### Create aliases for long commands with a Makefile\n\n**Makefile** lets you define shortcuts for commands that you might want to run, E.g., in our \u003cu\u003e[Makefile](https://github.com/josephmachado/efficient_data_processing_spark/blob/main/Makefile)\u003c/u\u003e, we set the alias `spark-sql` for the command that opens us a spark sql session.\n\nWe have some helpful **make** commands for working with our systems. Shown below are the make commands and their definitions\n\n1. `make restart`: Stops running docker containers(if any) and starts new containers for our data infra.\n2. `make setup`: Generates data and [loads them into tables](https://github.com/josephmachado/efficient_data_processing_spark/blob/main/containers/spark/setup.sql) and starts spark histroy server where we can see logs/Spark UI for already completed jobs.\n3. `make spark-sql`: Open a spark sql session; Use exit to quit the cli. **This is where you will type your SQL queries**.\n4. `make cr`: To run our pyspark code by pasting the relative path of exercise/example problems under [data-processing-spark](./data-processing-spark/) folder. See example image shown below.\n5. `make rainforest`: Runs our rainforest capstone project, the entry point for this code is [here](./capstone/run_code.py)\n\nThis is how you run pyspark exercise files:\n![make cr example](./assets/make_cr.gif)\n\nYou can see the commands in \u003cu\u003e[this Makefile](https://github.com/josephmachado/efficient_data_processing_spark/blob/main/Makefile)\u003c/u\u003e. If your terminal does not support **make** commands, please use the commands in \u003cu\u003e[the Makefile](https://github.com/josephmachado/efficient_data_processing_spark/blob/main/Makefile)\u003c/u\u003e directly. All the commands in this book assume that you have the docker containers running.\n\nYou can test and run the capstone project as:\n\n```bash\nmake pytest # to run all test cases\nmake ci # to run linting, formatting, and type checks\nmake rainforest # to run our ETL and create the final reports\n```\n### Run a Jupyter notebook\n\nUse the following command to start a jupyter server:\n\n```bash\nmake notebook\n```\n\nYou will see a link displayed with the format http://127.0.0.1:3000/?token=your-token, click it to open the jupyter notebook on your browser. You can use [local jupyter notebook sample](./assets/sample_jupyter_notebook.ipynb) to get started.\n\nYou can stop the jupyter server with ctrl + c.\n\n## Infrastructure \n\nWe have three major services that run together, they are\n\n1. **Postgres database**: We use a postgres data base to simulate an upstream application database for our rainforest capstone project.\n2. **Spark cluster**: We create a spark cluster with a master and 2 workers which is where the data is processed. The spark cluster also includes a history server, which displays the logs and resource utilization (Spark UI) for completed/failed spark applications.\n3. **Minio**: Minio is an open source software that has fully compatable API with AWS S3 cloud storage system. We use minio to replicate S3 locally.\n\n![Infra](./assets/infra.png)\n\nAll our Spark images are built from the official Spark Delta image, and have the necessary modules installed. You can find the docker files defined [here](./data-processing-spark/1-lab-setup/containers/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosephmachado%2Fefficient_data_processing_spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjosephmachado%2Fefficient_data_processing_spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosephmachado%2Fefficient_data_processing_spark/lists"}