{"id":15069479,"url":"https://github.com/rubenafo/docker-spark-cluster","last_synced_at":"2025-04-10T16:52:38.859Z","repository":{"id":66361636,"uuid":"128363324","full_name":"rubenafo/docker-spark-cluster","owner":"rubenafo","description":"A Spark cluster setup running on Docker containers","archived":false,"fork":false,"pushed_at":"2019-12-26T10:40:15.000Z","size":35,"stargazers_count":60,"open_issues_count":0,"forks_count":42,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-03-24T14:44:46.007Z","etag":null,"topics":["big-data","docker","docker-image","hadoop","openjdk","scala","spark"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rubenafo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-04-06T07:24:01.000Z","updated_at":"2024-02-29T04:50:46.000Z","dependencies_parsed_at":"2023-02-21T18:15:14.006Z","dependency_job_id":null,"html_url":"https://github.com/rubenafo/docker-spark-cluster","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rubenafo%2Fdocker-spark-cluster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rubenafo%2Fdocker-spark-cluster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rubenafo%2Fdocker-spark-cluster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rubenafo%2Fdocker-spark-cluster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rubenafo","download_url":"https://codeload.github.com/rubenafo/docker-spark-cluster/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248256367,"owners_count":21073510,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","docker","docker-image","hadoop","openjdk","scala","spark"],"created_at":"2024-09-25T01:42:44.045Z","updated_at":"2025-04-10T16:52:38.836Z","avatar_url":"https://github.com/rubenafo.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# docker-spark-cluster\nBuild your own Spark cluster setup in Docker.      \nA multinode Spark installation where each node of the network runs in its own separated Docker container.   \nThe installation takes care of the Hadoop \u0026 Spark configuration, providing:\n1) a debian image with scala and java (scalabase image)\n2) four fully configured Spark nodes running on Hadoop (sparkbase image):\n    * nodemaster (master node)\n    * node2      (slave)\n    * node3      (slave)\n    * node4      (slave)\n\n## Motivation\nYou can run Spark in a (boring) standalone setup or create your own network to hold a full cluster setup inside Docker instead.   \nI find the latter much more fun:\n* you can experiment with a more realistic network setup\n* tweak nodes configuration\n* simulate scalability, downtimes and rebalance by adding/removing nodes to the network automagically   \n\nThere is a Medium article related to this: https://medium.com/@rubenafo/running-a-spark-cluster-setup-in-docker-containers-573c45cceabf\n\n## Installation\n1) Clone this repository\n2) cd scalabase\n3) ./build.sh    # This builds the base java+scala debian container from openjdk9\n4) cd ../spark\n5) ./build.sh    # This builds sparkbase image\n6) run ./cluster.sh deploy\n7) The script will finish displaying the Hadoop and Spark admin URLs:\n    * Hadoop info @ nodemaster: http://172.18.1.1:8088/cluster\n    * Spark info @ nodemaster : http://172.18.1.1:8080/\n    * DFS Health @ nodemaster : http://172.18.1.1:9870/dfshealth.html\n\n## Options\n```bash\ncluster.sh stop   # Stop the cluster\ncluster.sh start  # Start the cluster\ncluster.sh info   # Shows handy URLs of running cluster\n\n# Warning! This will remove everything from HDFS\ncluster.sh deploy # Format the cluster and deploy images again\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frubenafo%2Fdocker-spark-cluster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frubenafo%2Fdocker-spark-cluster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frubenafo%2Fdocker-spark-cluster/lists"}