{"id":16162452,"url":"https://github.com/sunsided/spark-atlas","last_synced_at":"2026-04-09T07:06:33.101Z","repository":{"id":197675658,"uuid":"695018599","full_name":"sunsided/spark-atlas","owner":"sunsided","description":"Spark vs. MongoDB Atlas","archived":false,"fork":false,"pushed_at":"2023-10-02T09:15:22.000Z","size":51,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-13T08:27:42.076Z","etag":null,"topics":["data-processing","docker","jupyter-notebook","mongodb","mongodb-atlas","pyspark","python","spark"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sunsided.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-09-22T07:17:04.000Z","updated_at":"2023-10-02T09:55:07.000Z","dependencies_parsed_at":null,"dependency_job_id":"747b6db4-d2ed-4354-adae-6d189c329905","html_url":"https://github.com/sunsided/spark-atlas","commit_stats":{"total_commits":17,"total_committers":2,"mean_commits":8.5,"dds":0.05882352941176472,"last_synced_commit":"51eeaf174409723de46f764e0afd74c515ac79c8"},"previous_names":["sunsided/spark-atlas"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sunsided%2Fspark-atlas","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sunsided%2Fspark-atlas/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sunsided%2Fspark-atlas/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sunsided%2Fspark-atlas/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sunsided","download_url":"https://codeload.github.com/sunsided/spark-atlas/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247589835,"owners_count":20963022,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-processing","docker","jupyter-notebook","mongodb","mongodb-atlas","pyspark","python","spark"],"created_at":"2024-10-10T02:30:14.116Z","updated_at":"2025-12-30T23:06:38.486Z","avatar_url":"https://github.com/sunsided.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PySpark + MongoDB + SingleStore\n\nUse Docker Compose to start the setup\n\n```shell\ndocker compose up\n```\n\nThis will start a setup of\n\n- Spark Master (at [localhost:8090](http://localhost:8090/))\n- Spark Worker with 2 CPUs and 4 GB RAM (at [localhost:8081](http://localhost:8081/))\n- Spark Worker with 4 CPUs and 4 GB RAM (at [localhost:8082](http://localhost:8082/))\n- Spark History Server (at [localhost:18081](http://localhost:18081/))\n\nand\n\n- Jupyter Lab (at [localhost:8888](http://127.0.0.1:8888/lab?token=5f69150501c3c0c4f94f5d4ae38123e2f556777f794bf48b))\n\nOpen JupyterLab [here](http://127.0.0.1:8888/lab?token=5f69150501c3c0c4f94f5d4ae38123e2f556777f794bf48b)\nor connect to the Jupyter server at `127.0.0.1:8888` and use the following token:\n\n```\n5f69150501c3c0c4f94f5d4ae38123e2f556777f794bf48b\n```\n\nUse the [Aggegation Pipelines](notebooks/AggregationPipelines.ipynb) notebook\nas a starting point.\n\n## About the Dockerfile\n\nThe [Dockerfile](spark/Dockerfile) (as used in [docker-compose.yml](docker-compose.yml))\nprovides three different Docker targets, namely `master`, `worker` and `jupyter`.\nAll three targets share the same `base` images consisting of: \n\n- [Spark 3.4.1] (Scala 2.12 + Hadoop 3.3) + PySpark 3.4.1 + [MongoDB Connector for Spark 10.2] + [SingleStore JDBC 1.1.9]\n- Ubuntu 23.04 with Java/OpenJDK 17 and Python 3.11\n\nUsing the same base image for Jupyter Lab and Spark was the only way to\nget this setup working; specifically, having only `master` and `worker` images\nand a predefined PySpark image would consistently fail with either JARs not being\nfound or serialization issues happening when running PySpark programs.\n\n[Spark 3.4.1]: https://spark.apache.org/downloads.html\n[MongoDB Connector for Spark 10.2]: https://www.mongodb.com/docs/spark-connector/v10.2/\n[SingleStore JDBC 1.1.9]: https://github.com/memsql/S2-JDBC-Connector/releases/tag/v1.1.9","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsunsided%2Fspark-atlas","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsunsided%2Fspark-atlas","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsunsided%2Fspark-atlas/lists"}