{"id":15208671,"url":"https://github.com/turnipdo/docker-spark-setup","last_synced_at":"2026-02-07T23:01:18.533Z","repository":{"id":245406341,"uuid":"818124004","full_name":"Turnipdo/Docker-Spark-Setup","owner":"Turnipdo","description":"Setting up a Spark cluster in a Docker environment for improved repeatability and reliability. This project includes a simple transformation on a dataset containing approximately 31 million rows.","archived":false,"fork":false,"pushed_at":"2024-06-21T08:18:34.000Z","size":8,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-25T00:13:18.842Z","etag":null,"topics":["big-data-processing","docker-container","setup","spark"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Turnipdo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-21T06:40:50.000Z","updated_at":"2024-08-11T07:12:52.000Z","dependencies_parsed_at":"2024-06-22T01:39:31.403Z","dependency_job_id":"d96d8fc0-e645-4f78-8b13-8eb677893c51","html_url":"https://github.com/Turnipdo/Docker-Spark-Setup","commit_stats":null,"previous_names":["turnipdo/docker-spark-setup"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Turnipdo/Docker-Spark-Setup","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Turnipdo%2FDocker-Spark-Setup","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Turnipdo%2FDocker-Spark-Setup/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Turnipdo%2FDocker-Spark-Setup/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Turnipdo%2FDocker-Spark-Setup/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Turnipdo","download_url":"https://codeload.github.com/Turnipdo/Docker-Spark-Setup/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Turnipdo%2FDocker-Spark-Setup/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29211553,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-07T22:58:45.823Z","status":"ssl_error","status_checked_at":"2026-02-07T22:58:45.272Z","response_time":63,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data-processing","docker-container","setup","spark"],"created_at":"2024-09-28T07:01:33.999Z","updated_at":"2026-02-07T23:01:18.516Z","avatar_url":"https://github.com/Turnipdo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Docker-Spark-Setup :atom:\nSetting up a Spark cluster in a Docker environment for improved repeatability and reliability. This project includes a simple transformation on a dataset containing approximately 31 million rows.\n\n## Requirements :basecamp:\n* `Docker-Desktop`\n* `Pyspark`\n* `Python`\n* `VScode`\n\n## Instructions :page_with_curl:\n### File creations + All the scripts needed in your directory\n\n* You want to start by creating a project directory anywhere you'd like, for me this is the location and I also named it :file_folder:Spark-Cluster-Setup:\n`C:\\Users\\Username\\Projects\\Spark-Cluster-Setup`\u003cbr\u003e\u003cbr\u003e\n* You will also need to create another subfolder inside this directory called :file_folder:scripts which will hold the python scripts we will be executing the data transformations: `C:\\Users\\Username\\Projects\\Spark-Cluster-Setup\\scripts`\u003cbr\u003e\u003cbr\u003e\n* I've also downloaded a data set (roughly around 31 million rows) from [Kaggle](https://www.kaggle.com/datasets/ealtman2019/ibm-transactions-for-anti-money-laundering-aml), the one I used was `LI-Medium_Trans.csv`. I created another subfolder inside my scripts folder, named it :file_folder:LI-Medium_Trans_Folder and dumped the csv file in there. The hierarchy looks like this: `C:\\Users\\Username\\Projects\\Spark-Cluster-Setup\\scripts\\LI-Medium_Trans_Folder\\LI-Medium_Trans.csv`\u003cbr\u003e\u003cbr\u003e \n* Once all of this is created, navigate back to the :file_folder:Spark-Cluster-Setup folder and if you're already navigating using the temrinal window then great, if not right-click and click on `Open Terminal` and type the following command to open VScode:\u003cbr\u003e\n\n```bash\nPS C:\\Users\\Username\\Projects\\Spark-Cluster-Setup\u003e code .\n```\n* Once VScode is opened, you first need to create a `Dockerfile` inside the same directory, we will use the pre-built image for spark offered by bitnami which makes this setup super easy and understandable.\n  * The following code essentially just pulls the latest image provided by bitnami.\n  * Set the environment variable and specifying that the Spark instance will run as the mater node.\n  * From the `/opt/bitnami/spark/bin/spark-class` path, use the command `org.apache.spark.deploy.master.Master` command to start the Spark master.\n```Dockerfile\nFROM bitnami/spark:latest\n\nENV SPARK_MODE=master\n\nCMD [\"/opt/bitnami/spark/bin/spark-class\",\"org.apache.spark.deploy.master.Master\"]\n```\n* We have to create a `docker-compose.yaml` file to build a container for the master and all of its worker nodes, along with their configurations.\n  * For the spark master you will see that we've omitted the `image: bitnami/spark:latest`, this is because we have a `build: .` which indicates to build the image from the `Dockerfile` previously mentioned into the current directory.\n  * For each and every node, there must be an associated container, and all the worker nodes depend on on the spark-master.\n  * Alot of the configs in the `environment` section is basically from the [documentation](https://spark.apache.org/docs/latest/spark-standalone.html) provided by spark.\n  * I've basically configured the master and worker nodes taking into consideration the limitations of my computer's performance (6 cores and 16 GB of RAM)\n  * You can also find the default `SPARK_MASTER_URL` and more information on bitnami's spark image [here](https://hub.docker.com/r/bitnami/spark).\n  * I added a volumes section in the `docker-compose.yml` file for both the Spark master and every worker node. This is crucial because it mounts the :file_folder:scripts folder where all your `.py` scripts are located. This ensures that both the master and worker nodes know where to find and execute the Python scripts.\n```yaml\nversion: '3'\nservices:\n  spark-master:\n    build: .\n    container_name: spark-master\n    hostname: spark-master\n    environment:\n      - SPARK_MODE=master\n      - SPARK_MASTER_PORT=7077\n      - SPARK_MASTER_WEBUI_PORT=8080\n      - SPARK_DAEMON_MEMORY=3g\n    volumes:\n      - ./scripts:/scripts\n    ports:\n      - \"8080:8080\"\n      - \"7077:7077\"\n\n  spark-worker-1:\n    image: bitnami/spark:latest\n    container_name: spark-worker-1\n    hostname: spark-worker-1\n    environment:\n      - SPARK_MODE=worker\n      - SPARK_MASTER_URL=spark://spark-master:7077\n      - SPARK_WORKER_WEBUI_PORT=8081\n      - SPARK_WORKER_CORES=1\n      - SPARK_WORKER_MEMORY=3G\n    volumes:\n      - ./scripts:/scripts\n    depends_on:\n      - spark-master\n    ports:\n      - \"8081:8081\"\n\n  spark-worker-2:\n    image: bitnami/spark:latest\n    container_name: spark-worker-2\n    hostname: spark-worker-2\n    environment:\n      - SPARK_MODE=worker\n      - SPARK_MASTER_URL=spark://spark-master:7077\n      - SPARK_WORKER_WEBUI_PORT=8082\n      - SPARK_WORKER_CORES=1\n      - SPARK_WORKER_MEMORY=3G\n    volumes:\n      - ./scripts:/scripts\n    depends_on:\n      - spark-master\n    ports:\n      - \"8082:8082\"\n\n  spark-worker-3:\n    image: bitnami/spark:latest\n    container_name: spark-worker-3\n    hostname: spark-worker-3\n    environment:\n      - SPARK_MODE=worker\n      - SPARK_MASTER_URL=spark://spark-master:7077\n      - SPARK_WORKER_WEBUI_PORT=8083\n      - SPARK_WORKER_CORES=1\n      - SPARK_WORKER_MEMORY=3G\n    volumes:\n      - ./scripts:/scripts\n    depends_on:\n      - spark-master\n    ports:\n      - \"8083:8083\"\n```\n* You can also use the following python script (I called it `test_script.py`) that will start the spark session, read the csv file and do a simple average calculation based on the Payment Formats in the csv file.\n  * I used the time module because I wanted to see how much time it takes to process 31 million rows of data (first time doing this and i'm also a noob still so I was curious)\n  * You can honestly omit the entire schema part because it worked well even if I didn't use the schema, I just wanted to try it out and practice so, apologies in advance.\n\n```python\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.functions import avg\nfrom pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType\nimport time\n\nspark = SparkSession.builder \\\n    .appName(\"DataRead\u0026Process\") \\\n    .getOrCreate()\n\nschema = StructType([\n    StructField(\"Timestamp\", StringType(), True),\n    StructField(\"From Bank\", IntegerType(), True),\n    StructField(\"From Account\", StringType(), True),\n    StructField(\"To Bank\", IntegerType(), True),\n    StructField(\"To Account\", StringType(), True),\n    StructField(\"Amount Received\", DoubleType(), True),\n    StructField(\"Receiving Currency\", StringType(), True),\n    StructField(\"Amount Paid\", DoubleType(), True),\n    StructField(\"Payment Currency\", StringType(), True),\n    StructField(\"Payment Format\", StringType(), True)\n])\n\n\nstart_time_load = time.time()\ndf = spark.read.csv(\"/scripts/LI-Medium_Trans_Folder/LI-Medium_Trans.csv\", header=True, schema=schema)\nselected_df = df.select(\"Amount Paid\", \"Payment Format\")\nend_time_load = time.time()\n\nstart_time_transform = time.time()\navg_amt_paid = selected_df.groupBy(\"Payment Format\").agg(avg(\"Amount Paid\").alias(\"Avg Amount Paid\"))\navg_amt_paid.show(50, truncate=False)\nend_time_transform = time.time()\n\nload_time = end_time_load - start_time_load\ntransform_time = end_time_transform - start_time_transform\n\nprint(f\"time taken to load data: {load_time}\")\nprint(f\"time taken to transform data: {transform_time}\")\n```\n### Building the Docker image and running the python script!\n* After you have all those file setup, make sure everything is saved and we will not build the Docker Image.\u003cbr\u003e\u003cbr\u003e\n* Open up Docker-Desktop, ensure Docker engine is running, open up a new Terminal window in your VScode and use the following command\n```bash\n$ docker-compose up --build\n[+] Building 0.1s (5/5) FINISHED                                                                                          docker:default\n =\u003e [spark-master internal] load build definition from Dockerfile                                                                   0.0s\n =\u003e =\u003e transferring dockerfile: 180B                                                                                                0.0s\n =\u003e [spark-master internal] load metadata for docker.io/bitnami/spark:latest                                                        0.0s\n =\u003e [spark-master internal] load .dockerignore                                                                                      0.0s\n =\u003e =\u003e transferring context: 2B                                                                                                     0.0s\n =\u003e CACHED [spark-master 1/1] FROM docker.io/bitnami/spark:latest                                                                   0.0s\n =\u003e [spark-master] exporting to image                                                                                               0.0s\n =\u003e =\u003e exporting layers                                                                                                             0.0s\n =\u003e =\u003e writing image sha256:173aa9b7301ad4cf28f237ef5b606aeab444fd9ff60248a84cb44306fd456c12                                        0.0s\n =\u003e =\u003e naming to docker.io/library/spark-cluster-setup-spark-master                                                                 0.0s\n[+] Running 5/5\n ✔ Network spark-cluster-setup_default  Created                                                                                     0.0s \n ✔ Container spark-master               Created                                                                                     0.1s \n ✔ Container spark-worker-3             Created                                                                                     0.1s \n ✔ Container spark-worker-1             Created                                                                                     0.1s \n ✔ Container spark-worker-2             Created        \n```\n* You've succesfully setup your spark-cluster using Docker, you can navigate to `http://localhost:8080/home` to verify if the master and all of your worker nodes are up and running by changing 8080 to 8081, 8082, etc.\u003cbr\u003e\u003cbr\u003e\n* Open up a new terminal in VScode, and use the following commands, you'll have a long list when you execute the python script so I've just shortened it so you have an idea how it looks:\n```bash\n$ docker exec -it spark-master bash\nI have no name!@spark-master:/opt/bitnami/spark$\n\n$ cd /scripts\nI have no name!@spark-master:/scripts$\n\n$ spark-submit --master spark://spark-master:7077 /scripts/test_script.py\n24/06/21 08:04:47 INFO SparkContext: Running Spark version 3.5.1\n24/06/21 08:04:47 INFO SparkContext: OS info Linux, 5.15.153.1-microsoft-standard-WSL2, amd64\n24/06/21 08:04:47 INFO SparkContext: Java version 17.0.11\n24/06/21 08:04:47 INFO ResourceUtils: ==============================================================\n24/06/21 08:04:47 INFO ResourceUtils: No custom resources configured for spark.driver.\n24/06/21 08:04:47 INFO ResourceUtils: ==============================================================\n...\n24/06/21 08:05:02 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3) (172.18.0.5, executor 0, partition 3, PROCESS_LOCAL, 8229 bytes)\n24/06/21 08:05:02 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6492 ms on 172.18.0.5 (executor 0) (1/23)\n24/06/21 08:05:02 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4) (172.18.0.3, executor 1, partition 4, PROCESS_LOCAL, 8229 bytes)\n24/06/21 08:05:02 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 6755 ms on 172.18.0.3 (executor 1) (2/23)\n24/06/21 08:05:02 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5) (172.18.0.4, executor 2, partition 5, PROCESS_LOCAL, 8229 bytes)\n...\n24/06/21 08:05:23 INFO TaskSetManager: Finished task 22.0 in stage 0.0 (TID 22) in 675 ms on 172.18.0.4 (executor 2) (22/23)\n24/06/21 08:05:24 INFO TaskSetManager: Finished task 20.0 in stage 0.0 (TID 20) in 3575 ms on 172.18.0.3 (executor 1) (23/23)\n24/06/21 08:05:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool\n...\n+--------------+--------------------+\n|Payment Format|Avg Amount Paid     |\n+--------------+--------------------+\n|ACH           |9045817.639857756   |\n|Credit Card   |74801.77192402656   |\n|Reinvestment  |2456738.869398785   |\n|Cheque        |7171603.898055333   |\n|Cash          |1.1868686948102372E7|\n|Wire          |4603325.2566971695  |\n|Bitcoin       |63.419871884172125  |\n+--------------+--------------------+\n\ntime taken to load data: 4.0996832847595215\ntime taken to transform data: 31.357645511627197\n...\n```\n* I know it should only be two floating points for currency but please spare me for now :bowtie:\u003cbr\u003e\u003cbr\u003e\n* If you want to stop the containers, all you have to do is use the following command:\n```bash\n$ exit bash\nexit\nbash: exit: bash: numeric argument required\n\n$ docker compose down\n[+] Running 5/5\n ✔ Container spark-worker-1             Removed                                                                                                                                                                                                                                                      10.7s \n ✔ Container spark-worker-2             Removed                                                                                                                                                                                                                                                      11.0s \n ✔ Container spark-worker-3             Removed                                                                                                                                                                                                                                                      10.8s \n ✔ Container spark-master               Removed                                                                                                                                                                                                                                                       0.9s \n ✔ Network spark-cluster-setup_default  Removed  \n```\n* From next time on, if you want to start the container again it's simply just:\n```bash\n$ docker compose up\n[+] Running 5/5\n ✔ Network spark-cluster-setup_default  Created                                                                                                                                                                                                                                                       0.0s \n ✔ Container spark-master               Created                                                                                                                                                                                                                                                       0.0s \n ✔ Container spark-worker-2             Created                                                                                                                                                                                                                                                       0.1s \n ✔ Container spark-worker-3             Created                                                                                                                                                                                                                                                       0.1s \n ✔ Container spark-worker-1             Created                                                                                                                                                                                                                                                       0.1s \nAttaching to spark-master, spark-worker-1, spark-worker-2, spark-worker-3\n```\n## Conclusion :electron:\nWe've not only learned how to set up a standalone Spark cluster, but also optimized it for repeatability and reliability by leveraging Docker containerization. While the bitnami pre-built image restricts extensive customization options, it allowed me to quickly establish and gain experience with this powerful big data transformation tool.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fturnipdo%2Fdocker-spark-setup","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fturnipdo%2Fdocker-spark-setup","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fturnipdo%2Fdocker-spark-setup/lists"}