{"id":20011365,"url":"https://github.com/mvillarrealb/docker-spark-cluster","last_synced_at":"2025-05-16T18:09:37.677Z","repository":{"id":34784540,"uuid":"150004343","full_name":"mvillarrealb/docker-spark-cluster","owner":"mvillarrealb","description":"A simple spark standalone cluster for your testing environment purposses","archived":false,"fork":false,"pushed_at":"2024-03-06T16:38:29.000Z","size":2412,"stargazers_count":571,"open_issues_count":23,"forks_count":355,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-04-12T16:59:27.919Z","etag":null,"topics":["bigdata","developer-tools","docker-compose","spark"],"latest_commit_sha":null,"homepage":null,"language":"Dockerfile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mvillarrealb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-23T16:41:45.000Z","updated_at":"2025-04-06T09:53:49.000Z","dependencies_parsed_at":"2024-12-03T23:02:36.856Z","dependency_job_id":"aa6bd6f9-887c-47c3-b74e-806244d56012","html_url":"https://github.com/mvillarrealb/docker-spark-cluster","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mvillarrealb%2Fdocker-spark-cluster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mvillarrealb%2Fdocker-spark-cluster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mvillarrealb%2Fdocker-spark-cluster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mvillarrealb%2Fdocker-spark-cluster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mvillarrealb","download_url":"https://codeload.github.com/mvillarrealb/docker-spark-cluster/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254582907,"owners_count":22095518,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","developer-tools","docker-compose","spark"],"created_at":"2024-11-13T07:25:33.938Z","updated_at":"2025-05-16T18:09:37.631Z","avatar_url":"https://github.com/mvillarrealb.png","language":"Dockerfile","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark Cluster with Docker \u0026 docker-compose(2021 ver.)\n\n# General\n\nA simple spark standalone cluster for your testing environment purposses. A *docker-compose up* away from you solution for your spark development environment.\n\nThe Docker compose will create the following containers:\n\ncontainer|Exposed ports\n---|---\nspark-master|9090 7077\nspark-worker-1|9091\nspark-worker-2|9092\ndemo-database|5432\n\n# Installation\n\nThe following steps will make you run your spark cluster's containers.\n\n## Pre requisites\n\n* Docker installed\n\n* Docker compose  installed\n\n## Build the image\n\n\n```sh\ndocker build -t cluster-apache-spark:3.0.2 .\n```\n\n## Run the docker-compose\n\nThe final step to create your test cluster will be to run the compose file:\n\n```sh\ndocker-compose up -d\n```\n\n## Validate your cluster\n\nJust validate your cluster accesing the spark UI on each worker \u0026 master URL.\n\n### Spark Master\n\nhttp://localhost:9090/\n\n![alt text](docs/spark-master.png \"Spark master UI\")\n\n### Spark Worker 1\n\nhttp://localhost:9091/\n\n![alt text](docs/spark-worker-1.png \"Spark worker 1 UI\")\n\n### Spark Worker 2\n\nhttp://localhost:9092/\n\n![alt text](docs/spark-worker-2.png \"Spark worker 2 UI\")\n\n\n# Resource Allocation \n\nThis cluster is shipped with three workers and one spark master, each of these has a particular set of resource allocation(basically RAM \u0026 cpu cores allocation).\n\n* The default CPU cores allocation for each spark worker is 1 core.\n\n* The default RAM for each spark-worker is 1024 MB.\n\n* The default RAM allocation for spark executors is 256mb.\n\n* The default RAM allocation for spark driver is 128mb\n\n* If you wish to modify this allocations just edit the env/spark-worker.sh file.\n\n# Binded Volumes\n\nTo make app running easier I've shipped two volume mounts described in the following chart:\n\nHost Mount|Container Mount|Purposse\n---|---|---\napps|/opt/spark-apps|Used to make available your app's jars on all workers \u0026 master\ndata|/opt/spark-data| Used to make available your app's data on all workers \u0026 master\n\nThis is basically a dummy DFS created from docker Volumes...(maybe not...)\n\n# Run Sample applications\n\n\n## NY Bus Stops Data [Pyspark]\n\nThis programs just loads archived data from [MTA Bus Time](http://web.mta.info/developers/MTA-Bus-Time-historical-data.html) and apply basic filters using spark sql, the result are persisted into a postgresql table.\n\nThe loaded table will contain the following structure:\n\nlatitude|longitude|time_received|vehicle_id|distance_along_trip|inferred_direction_id|inferred_phase|inferred_route_id|inferred_trip_id|next_scheduled_stop_distance|next_scheduled_stop_id|report_hour|report_date\n---|---|---|---|---|---|---|---|---|---|---|---|---\n40.668602|-73.986697|2014-08-01 04:00:01|469|4135.34710710144|1|IN_PROGRESS|MTA NYCT_B63|MTA NYCT_JG_C4-Weekday-141500_B63_123|2.63183804205619|MTA_305423|2014-08-01 04:00:00|2014-08-01\n\nTo submit the app connect to one of the workers or the master and execute:\n\n```sh\n/opt/spark/bin/spark-submit --master spark://spark-master:7077 \\\n--jars /opt/spark-apps/postgresql-42.2.22.jar \\\n--driver-memory 1G \\\n--executor-memory 1G \\\n/opt/spark-apps/main.py\n```\n\n![alt text](./articles/images/pyspark-demo.png \"Spark UI with pyspark program running\")\n\n## MTA Bus Analytics[Scala]\n\nThis program takes the archived data from [MTA Bus Time](http://web.mta.info/developers/MTA-Bus-Time-historical-data.html) and make some aggregations on it, the calculated results are persisted on postgresql tables.\n\nEach persisted table correspond to a particullar aggregation:\n\nTable|Aggregation\n---|---\nday_summary|A summary of vehicles reporting, stops visited, average speed and distance traveled(all vehicles)\nspeed_excesses|Speed excesses calculated in a 5 minute window\naverage_speed|Average speed by vehicle\ndistance_traveled|Total Distance traveled by vehicle\n\n\nTo submit the app connect to one of the workers or the master and execute:\n\n```sh\n/opt/spark/bin/spark-submit --deploy-mode cluster \\\n--master spark://spark-master:7077 \\\n--total-executor-cores 1 \\\n--class mta.processing.MTAStatisticsApp \\\n--driver-memory 1G \\\n--executor-memory 1G \\\n--jars /opt/spark-apps/postgresql-42.2.22.jar \\\n--conf spark.driver.extraJavaOptions='-Dconfig-path=/opt/spark-apps/mta.conf' \\\n--conf spark.executor.extraJavaOptions='-Dconfig-path=/opt/spark-apps/mta.conf' \\\n/opt/spark-apps/mta-processing.jar\n```\n\nYou will notice on the spark-ui a driver program and executor program running(In scala we can use deploy-mode cluster)\n\n![alt text](./articles/images/stats-app.png \"Spark UI with scala program running\")\n\n\n# Summary\n\n* We compiled the necessary docker image to run spark master and worker containers.\n\n* We created a spark standalone cluster using 2 worker nodes and 1 master node using docker \u0026\u0026 docker-compose.\n\n* Copied the resources necessary to run demo applications.\n\n* We ran a distributed application at home(just need enough cpu cores and RAM to do so).\n\n# Why a standalone cluster?\n\n* This is intended to be used for test purposes, basically a way of running distributed spark apps on your laptop or desktop.\n\n* This will be useful to use CI/CD pipelines for your spark apps(A really difficult and hot topic)\n\n# Steps to connect and use a pyspark shell interactively\n\n* Follow the steps to run the docker-compose file. You can scale this down if needed to 1 worker. \n\n```sh\ndocker-compose up --scale spark-worker=1\ndocker exec -it docker-spark-cluster_spark-worker_1 bash\napt update\napt install python3-pip\npip3 install pyspark\npyspark\n```\n\n# What's left to do?\n\n* Right now to run applications in deploy-mode cluster is necessary to specify arbitrary driver port.\n\n* The spark submit entry in the start-spark.sh is unimplemented, the submit used in the demos can be triggered from any worker","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmvillarrealb%2Fdocker-spark-cluster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmvillarrealb%2Fdocker-spark-cluster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmvillarrealb%2Fdocker-spark-cluster/lists"}