{"id":19746363,"url":"https://github.com/astrolabsoftware/spark-kernel-nersc","last_synced_at":"2026-04-30T03:32:46.944Z","repository":{"id":92459606,"uuid":"147824352","full_name":"astrolabsoftware/spark-kernel-nersc","owner":"astrolabsoftware","description":" Create custom kernels for using pyspark notebooks at NERSC ","archived":false,"fork":false,"pushed_at":"2018-11-12T22:21:11.000Z","size":1329,"stargazers_count":2,"open_issues_count":0,"forks_count":2,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-07-02T15:50:01.025Z","etag":null,"topics":["jupyter-notebook","nersc","pyspark"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/astrolabsoftware.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-07T13:03:54.000Z","updated_at":"2018-11-12T22:21:12.000Z","dependencies_parsed_at":"2023-06-02T12:45:16.232Z","dependency_job_id":null,"html_url":"https://github.com/astrolabsoftware/spark-kernel-nersc","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/astrolabsoftware/spark-kernel-nersc","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astrolabsoftware%2Fspark-kernel-nersc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astrolabsoftware%2Fspark-kernel-nersc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astrolabsoftware%2Fspark-kernel-nersc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astrolabsoftware%2Fspark-kernel-nersc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/astrolabsoftware","download_url":"https://codeload.github.com/astrolabsoftware/spark-kernel-nersc/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astrolabsoftware%2Fspark-kernel-nersc/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32453746,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T22:27:22.272Z","status":"online","status_checked_at":"2026-04-30T02:00:05.929Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["jupyter-notebook","nersc","pyspark"],"created_at":"2024-11-12T02:14:22.304Z","updated_at":"2026-04-30T03:32:46.906Z","avatar_url":"https://github.com/astrolabsoftware.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Apache Spark kernel for Cori@NERSC\n\n## The kernels\n\nLog on Cori@NERSC, and run one of the scripts (`std-kernel.py` or `desc-kernel.py`) to create a Jupyter kernel for using pyspark in notebooks at NERSC.\n\nThe kernel will be stored at `$HOME/.local/share/jupyter/kernels/`. More information on how to use Apache Spark at NERSC can be found at this [page](http://www.nersc.gov/users/data-analytics/data-analytics-2/spark-distributed-analytic-framework/).\n\n## Apache Spark kernel for DESC members (recommended)\n\nCreate a kernel with python DESC environment (based on `desc-python`) and Apache Spark. On Cori, just launch:\n\n```\npython desc-kernel.py \\\n  -kernelname desc-pyspark \\\n  -pyspark_args \"--master local[4] \\\n  --driver-memory 32g --executor-memory 32g \\\n  --packages com.github.astrolabsoftware:spark-fits_2.11:0.7.1\"\n```\n\nAnd then select the kernel `desc-pyspark` in the JupyerLab interface.\nNote that the folders\n\n- `/global/cscratch1/sd/\u003cuser\u003e/tmpfiles`\n- `/global/cscratch1/sd/\u003cuser\u003e/spark/event_logs`\n\nwill be created if they do not exist to store temporary files and logs used by Spark.\n\n**Note** We provide a custom installation of the latest Spark version (2.3.2). This is maintained by me (Julien Peloton) at NERSC. If you encounter problems, let me know!\n\n## Apache Spark kernel alone\n\nKernels for running Apache Spark at NERSC are created using `std-kernel.py`.\n\n### Apache Spark version 2.3.0+ (recommended for beginners dev)\n\nFor Spark version 2.3.0+, Spark ran inside of Shifter.\nNote that he directory `/global/cscratch1/sd/\u003cuser\u003e/tmpfiles` will be created to store temporary files used by Spark.\n\n### Custom shifter images (recommended for experienced users)\n\nFor Spark version 2.3.0+, Spark ran inside of [Shifter](https://www.nersc.gov/research-and-development/user-defined-images/) (Docker for HPC). Since you are inside an image,\nyou do not have automatically access to your user-defined environment.\nTherefore you might want to create your Spark shifter image, based on the one NERSC\nprovides, but with additional packages you need installed.\nThe basic information on how to create a Shifter image at NERSC can be found [here](https://docs.nersc.gov/development/shifter/how-to-use/). The very first line\nof your DockerFile just needs to be:\n\n```\nFROM nersc/spark-2.3.0:v1\n\n# put here all the packages and dependencies\n# you need to have to run your pyspark jobs\n```\n\n### Apache Spark version \u003c= 2.1.0 (old)\n\nFor Spark version \u003c= 2.1.0, Spark is launched inside your environment.\nTherefore a startup script will be created in addition to the kernel,\nin order to load the Spark module and launch a Spark cluster before launching the notebook:\n\n```bash\n#!/bin/bash\nmodule load spark/\u003cversion\u003e\nstart-all.sh\n/usr/common/software/python/3.5-anaconda/bin/python -m ipykernel $@\n```\n\nThe startup scripts will be stored with the kernel at `$HOME/.local/share/jupyter/kernels/`.\nWe support only Python 3.5 for the moment.\n\n## Working with Apache Spark\n\n### Pyspark arguments\n\nPyspark most common arguments include:\n\n- `--master local[ncpu]`: the number of CPU to use.\n- `--conf spark.eventLog.enabled=true` `--conf spark.eventLog.dir=\u003cfile:/dir\u003e` `--conf spark.history.fs.logDirectory=\u003cfile:/dir\u003e`: store the logs. By default Spark will put event logs in `file://$SCRATCH/spark/spark_event_logs`, and you will need to create this directory the very first time you start up Spark.\n- `--packages ...`: Any package you want to use. For example, you can try out the great [spark-fits](https://github.com/astrolabsoftware/spark-fits) connector using `--packages com.github.astrolabsoftware:spark-fits_2.11:0.7.1`!\n\n### Access the logs from the Spark UI\n\nOnce your job is terminated, you can have access to the log via the Spark history UI. Log on Cori, and load the spark/history module:\n\n```\nmodule load spark/history\n```\n\nThen go to the folder where the logs are stored, and launch the history server and follow the URL:\n\n```\n# This is the default location\ncd $SCRATCH/spark/spark_event_logs\n./run_history_server.sh\n```\n\nOnce you are done, just stop the server by executing `./run_history_server.sh --stop`.\n\n### Note concerning resources\n\n    The large-memory login node used by https://jupyter-dev.nersc.gov/\n    is a shared resource, so please be careful not to use too many CPUs\n    or too much memory.\n\n    That means avoid using `--master local[*]` in your kernel, but limit\n    the resources to a few core. Typically `--master local[4]` is enough for\n    prototyping a program.\n\n## Use pyspark in JupyterLab\n\nConnect to https://jupyter-dev.nersc.gov/hub/login and create a notebook with\nthe kernel you just created:\n\n\u003cp align=\"center\"\u003e\u003cimg width=\"600\" src=\"https://github.com/astrolabsoftware/spark-kernel-nersc/raw/master/pic/load_kernel.png\"/\u003e \u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u003cimg width=\"600\" src=\"https://github.com/astrolabsoftware/spark-kernel-nersc/raw/master/pic/spark_notebook.png\"/\u003e \u003c/p\u003e\n\n## Known issue\n\nWhen switching kernels, and re-running a notebook, we often get the following error:\n\n```\nPy4JJavaError                             Traceback (most recent call last)\n/usr/local/bin/spark-2.3.0/python/pyspark/sql/utils.py in deco(*a, **kw)\n     62         try:\n---\u003e 63             return f(*a, **kw)\n     64         except py4j.protocol.Py4JJavaError as e:\n\n/usr/local/bin/spark-2.3.0/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py in\nget_return_value(answer, gateway_client, target_id, name)\n    319                     \"An error occurred while calling {0}{1}{2}.\\n\".\n--\u003e 320                     format(target_id, \".\", name), value)\n    321             else:\n\nPy4JJavaError: An error occurred while calling o83.load.\n: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException:\njava.lang.RuntimeException: Unable to instantiate\norg.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;\n\tat\norg.apache.spark.sql.hive.HiveExternalCatalog.withClient(\nHiveExternalCatalog.scala:106)\n\tat\n\n\t... (long... very long)\n\nAnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException:\nUnable to instantiate\norg.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'\n```\n\nJust go to the folder where the notebook is running, and delete the temporary folder:\n\n```\nrm -r metastore_db\n```\n\nThen restart your kernel, and all should be fine.\n\n## Thanks to\n\n- The NERSC consulting and support team for their great help!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrolabsoftware%2Fspark-kernel-nersc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fastrolabsoftware%2Fspark-kernel-nersc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrolabsoftware%2Fspark-kernel-nersc/lists"}