{"id":13399761,"url":"https://github.com/deanwampler/JustEnoughScalaForSpark","last_synced_at":"2025-03-14T04:31:56.914Z","repository":{"id":42581169,"uuid":"55907370","full_name":"deanwampler/JustEnoughScalaForSpark","owner":"deanwampler","description":"A tutorial on the most important features and idioms of Scala that you need to use Spark's Scala APIs.","archived":false,"fork":false,"pushed_at":"2022-07-09T21:16:34.000Z","size":6464,"stargazers_count":677,"open_issues_count":1,"forks_count":208,"subscribers_count":54,"default_branch":"master","last_synced_at":"2025-03-11T04:52:20.444Z","etag":null,"topics":["jupyter","scala","spark","tutorial"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deanwampler.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-04-10T15:37:49.000Z","updated_at":"2025-02-10T17:48:58.000Z","dependencies_parsed_at":"2022-09-11T07:12:25.092Z","dependency_job_id":null,"html_url":"https://github.com/deanwampler/JustEnoughScalaForSpark","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deanwampler%2FJustEnoughScalaForSpark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deanwampler%2FJustEnoughScalaForSpark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deanwampler%2FJustEnoughScalaForSpark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deanwampler%2FJustEnoughScalaForSpark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deanwampler","download_url":"https://codeload.github.com/deanwampler/JustEnoughScalaForSpark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243526885,"owners_count":20305109,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["jupyter","scala","spark","tutorial"],"created_at":"2024-07-30T19:00:42.408Z","updated_at":"2025-03-14T04:31:56.892Z","avatar_url":"https://github.com/deanwampler.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":"# Just Enough Scala for Spark\n\n[![Join the chat at https://gitter.im/deanwampler/JustEnoughScalaForSpark](https://badges.gitter.im/deanwampler/JustEnoughScalaForSpark.svg)](https://gitter.im/deanwampler/JustEnoughScalaForSpark?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge)\n\n* Added instructions for using Podman, fixed image version, Spark 3.2.1, July, 2022\n* Minor updates and bug fixes, April, 2021\n* Spark Summit San Francisco, June 5, 2017\n* Strata London, May 23, 2017\n* Strata San Jose, March 14, 2017\n* Strata Singapore, December 6, 2016\n* Strata NYC, September 27, 2016\n\n[Dean Wampler, Ph.D.](mailto:deanwampler@gmail.com)\u003cbr/\u003e\n[Chaoran Yu](https://github.com/yuchaoran2011) taught this tutorial at a few conferences, too.\n\n\u003e **NOTE:** It appears the `jupyter/all-spark-notebook` images are no longer built with Scala support, as of July 2022. These instructions use the last image released with Scala support, also supporting Spark 3.2.1.\n\nFrançois Sarradin (@fsarradin) and colleagues translated this tutorial to French. You can find it [here](https://github.com/univalence/CeQuilFautDeScalaPourSpark).\n\nThis tutorial now uses a Docker image with Jupyter and Spark, for a much more robust, easy to use, and \"industry standard\" experience.\n\nThis tutorial covers the most important features and idioms of [Scala](http://scala-lang.org/) you need to use [Apache Spark's](http://spark.apache.org/) Scala APIs. Because Spark is written in Scala, Spark is driving interest in Scala, especially for _data engineers_. _Data scientists_ sometimes use Scala, but most use Python or R.\n\n\u003e **Tips:**\n\u003e 1. If you're taking this tutorial at a conference, it's **essential** that you set up the tutorial ahead of time, as there won't be time during the session to work on any problems.\n\u003e 2. Use the [Gitter chat room](https://gitter.im/deanwampler/JustEnoughScalaForSpark) to ask for help or post issues to the [GitHub repo](https://github.com/deanwampler/JustEnoughScalaForSpark/issues) if you have trouble installing or running the tutorial.\n\u003e 3. If all else fails, there is a PDF of the tutorial in the `notebooks` directory.\n\n## Prerequisites\n\nI'll assume you have prior programming experience, in any language. Some familiarity with Java is assumed, but if you don't know Java, you should be able to search for explanations for anything unfamiliar.\n\nThis isn't an introduction to Spark itself. Some prior exposure to Spark is helpful, but I'll briefly explain most Spark concepts we'll encounter, too.\n\nThroughout, you'll find links to more information on important topics.\n\n## Download the Tutorial\n\nBegin by cloning or downloading the tutorial GitHub project [github.com/deanwampler/JustEnoughScalaForSpark](https://github.com/deanwampler/JustEnoughScalaForSpark).\n\n## About Jupyter with Spark\n\nThis tutorial uses a [Docker](https://docker.com) image that combines the popular [Jupyter](http://jupyter.org/) notebook environment with all the tools you need to run Spark, including the Scala language, called the [All Spark Notebook](https://hub.docker.com/r/jupyter/all-spark-notebook/). It bundles [Apache Toree](https://toree.apache.org/) to provide Spark and Scala access. The [webpage](https://hub.docker.com/r/jupyter/all-spark-notebook/) for this Docker image discusses useful information like using Python as well as Scala, user authentication topics, running your Spark jobs on clusters, rather than local mode, etc.\n\nThere are other notebook options you might investigate for your needs:\n\n**Open source:**\n\n* [Polynote](https://polynote.org/) - A cross-language notebook environment with built-in Scala support. Developed by Netflix.\n* [Jupyter](https://ipython.org/) + [BeakerX](http://beakerx.com/) - a powerful set of extensions for Jupyter.\n* [Zeppelin](http://zeppelin-project.org/) - a popular tool in big data environments\n\n**Commercial:**\n\n* [Databricks](https://databricks.com/) - a feature-rich, commercial, cloud-based service from the creators of Spark\n\n## Running the Tutorial\n\nIf you need to install Docker _or_ the popular replacement, [Podman](https://podman.io/).\n\n### Using Docker\n\nFollow the [Docker installation instructions](https://www.docker.com/products/overview). The _community edition_ is sufficient. Then start the docker daemon on your machine, as instructed.\n\n### Using Podman\n\nFollow the [Podman installation instructions](https://podman.io/getting-started/installation).\n\n\u003e **NOTE:** I don't have access to a Windows machine so I have not tested using Podman on Windows with this tutorial. _Caveat emptor_.\n\nA few other steps are recommended or required.\n\nFirst, if you plan to use `podman` instead of `docker`, it's convenient to alias the `docker` commands to corresponding `podman` commands:\n\n```shell\nalias docker=podman\nalias docker-compose=podman-compose\n```\n\nSimilar commands can be used for Windows. If you choose not to do this, just substitute `podman` for `docker` in the commands, shell scripts, and bat scripts discussed below.\n\nAlso, you'll need to initialize a Podman virtual machine. First, see if the default machine was already created by the installer. If so, we'll replace it with a \"beefier\" one.\n\n```shell\npodman system connection list     # List the VMs defined\n```\n\nDo you see the following output?\n\n```\nName                         URI                                                         Identity                                        Default\npodman-machine-default       ssh://core@localhost:53933/run/user/501/podman/podman.sock  /Users/.../.ssh/podman-machine-default  true\npodman-machine-default-root  ssh://root@localhost:53933/run/podman/podman.sock           /Users/.../.ssh/podman-machine-default  false\n```\n\nIn this case, run this `rm` command to delete it:\n\n```shell\npodman machine rm podman-machine-default\n```\n\nNow use the following commands to create the `podman-machine-default` with more resources and then start it running:\n\n```shell\npodman machine init --memory=4000 --cpus=4 -v $HOME:$HOME\npodman machine start\n```\n\n\u003e **Tip:** Run `podman machine stop` when you are finished with the tutorial.\n\nWhen you start the machine, you might see the following output:\n\n```\n$ podman machine start\nStarting machine \"podman-machine-default\"\nWaiting for VM ...\nMounting volume... /Users/...:/Users/...\n\nThis machine is currently configured in rootless mode. If your containers\nrequire root permissions (e.g. ports \u003c 1024), or if you run into compatibility\nissues with non-podman clients, you can switch using the following command:\n\n    podman machine set --rootful\n\nAPI forwarding listening on: /Users/.../.local/share/containers/podman/machine/podman-machine-default/podman.sock\n\nThe system helper service is not installed; the default Docker API socket\naddress can't be used by podman. If you would like to install it run the\nfollowing commands:\n\n    sudo /opt/homebrew/Cellar/podman/4.1.1/bin/podman-mac-helper install\n    podman machine stop; podman machine start\n\nYou can still connect Docker API clients by setting DOCKER_HOST using the\nfollowing command in your terminal session:\n\n    export DOCKER_HOST='unix:///Users/.../.local/share/containers/podman/machine/podman-machine-default/podman.sock'\n\nMachine \"podman-machine-default\" started successfully\n```\n\nDon't use the `--rootful` option, at least for this tutorial, as it is incompatible with a flag passed to `podman run` below, `--userns=keep-id`, which is required to be able to write updates to the notebooks.\n\nIf you are on a Mac, the `podman-mac-helper` may be useful:\n\n```shell\npodman machine stop\nsudo /opt/homebrew/Cellar/podman/4.1.1/bin/podman-mac-helper install\npodman machine start\n```\n\nSetting `DOCKER_HOST` is done for you in the `run.sh` script (discussed below) for MacOS and Linux. For Windows, you may need to set this in your environment. This is not done in `run-podman.bat`.\n\n\u003e **Note:** I don't have access to a Windows machine so I have not tested using Podman on Windows.\n\n## Running the Docker Image\n\nUse these steps for both `docker` and `podman`.\n\nIt's important to follow the next steps carefully. We're going to mount the working directory in the running container so it's accessible inside the running container in a convenient place. We'll need it for our notebook, our data, etc.\n\n* In the same terminal window, change to the directory where you expanded the tutorial project or cloned the repo.\n* Run the following command to download and run the Docker image: \n    * `run.sh` for both `docker` and `podman` on MacOS and Linux\n    * `run-docker.bat` to use `docker` on Windows\n    * `run-podman.bat` to use `podman` on Windows\n\nThe MacOS and Linux `run.sh` script executes these commands for `podman`:\n\n```shell\nexport DOCKER_HOST=\"unix:///$HOME/.local/share/containers/podman/machine/podman-machine-default/podman.sock\"\npodman run -it --rm \\\n  -p 8888:8888 -p 4040:4040 -p 4041:4041 -p 4042:4042 \\\n  --cpus=3.0 --memory=3500M \\\n  --userns=keep-id \\\n  -v \"$PWD\":/home/jovyan/work \\\n  jupyter/all-spark-notebook:spark-3.2.0 \\\n  \"$@\"\n```\n\nThe first command (note the line continuation...) checks if you are using podman and sets the `DOCKER_HOST` environment variable, which I found to be necessary. You might try just running the `docker run` command to see if you really need this.\n\nThe Windows `run.bat` command is similar, but uses Windows conventions and does _not_ attempt to set `DOCKER_HOST`.\n\nThe `-p 8888:8888 -p 4040:4040 -p 4041:4041 -p 4042:4042` arguments \"tunnel\" ports 8888 and 4040-4042 out of the container to your local environment, so you can get to the Jupyter UI at port 8888 and the Spark driver UIs at 4040-4042. \n\n\u003e **Note:** Here we use just one notebook, so tunneling 4040 is all you will probably need. However, if you create up to two more Spark notebooks concurrently, the second will select available port 4041 and the third will use 4042. Hence, the scripts tunnel these ports for your convenience.\n\nThe `--cpus=... --memory=...` arguments were added because the notebook \"kernel\" is prone to crashing with the default, smaller values. Edit to taste to see what works best for you. For example, previously we used `--cpus=2.0 --memory=2000M`, but machines are bigger now :) Also, it will help conserve resources to keep only one Spark notebook open at a time.\n\nThe `--userns=keep-id` appears to be necessary to allow you to save updates to the notebook. Otherwise, you get permissions errors. See also the [Troubleshooting](#troubleshooting) section below and [this blog post](https://www.redhat.com/sysadmin/debug-rootless-podman-mounted-volumes) for more information on why this is necessary. This flag isn't used for `docker`, because it doesn't appear necessary, although it may be \"harmless\".\n\nThe `-v $PWD:/home/jovyan/work` tells Docker to mount the current working directory inside the container as `/home/jovyan/work`, where `/home/jovyan` is the default user's home directory. When you open the notebook UI (discussed below), this is the top-level folder you will see.\n\nWe are using the image tagged `spark-3.2.0`, which actually supports Spark 3.2.1. Unfortunately, the `jupyter/all-spark-notebook` image builds [dropped Spark support in July 2022](https://github.com/deanwampler/JustEnoughScalaForSpark/issues/11).  :(\n\nFinally, you can pass other arguments to `run.sh` which are passed to `docker` or `podman`as flags or a command to run. For example, passing `ls -al work/notebooks` will show you what the file permissions look like for the notebooks from inside the container.\n\nOkay! After starting `run.sh`, you should see output similar to the following:\n\n```shell\nUnable to find image 'jupyter/all-spark-notebook:spark-3.2.0' locally\nlatest: Pulling from jupyter/all-spark-notebook\n...\n    Copy/paste this URL into your browser when you connect for the first time,\n    to login with a token:\n        http://localhost:8888/?token=...\n```\n\nNow copy and paste the full `localhost:8888` URL shown in a browser window.\n\n\u003e **Tip:** Your terminal might let you ⌘-click or CTRL-click the URL to open it in a browser.\n\n\u003e **Warning:** When you quit the container at the end of the tutorial, all your changes will be lost, unless they are in or under the local directory that we mounted. To save notebooks you defined in other locations, export them using the _File \u003e Download as \u003e Notebook_ menu item in toolbar.\n\n## Running the Tutorial\n\n\u003e **Warning:** It appears that the Jupyter _magics_ in the notebook no longer work. I have added comments and workarounds.\n\nNow we can load the tutorial. Once you open the Jupyter UI, you'll see the `work` listed. Click once to open it, then open `notebooks`, then click on the tutorial notebook, `JustEnoughScalaForSpark.ipynb`. It will open in a new tab. (The PDF is a print out of the notebook, in case you have trouble running the notebook itself.)\n\nYou'll notice there is a box around the first \"cell\". This cell has one line of source code `println(\"Hello World!\")`. Above this cell is a toolbar with a button that has a right-pointing arrow and the word _run_. Click that button to run this code cell. Or, use the menu item _Cell \u003e Run Cells_.\n\nAfter many seconds, once initialization has completed, it will print the output, `Hello World!` just below the input text field.\n\nDo the same thing for the next box. It should print `[merrywivesofwindsor, twelfthnight, midsummersnightsdream, loveslabourslost, asyoulikeit, comedyoferrors, muchadoaboutnothing, tamingoftheshrew]`, the contents of the `/home/jovyan/work/data/shakespeare` folder, the texts for several of Shakespeare's plays. We'll use these files as data.\n\n\u003e **Warning:** If you see `[]` or `null` printed instead, the mounting of the current working directory did not work correctly when the container was started. In the terminal window, use `control-c` to exit from the Docker container, make sure you are in the root directory of the project (`data` and `notebooks` should be subdirectories), restart the docker image, and make sure you enter the command exactly as shown.\n\nIf these steps worked, you're done setting up the tutorial!\n\n## Troubleshooting\n\n### \"Error Response\" on Windows\n\nWhen using Docker on Windows, you may get the following error: `C:\\Program Files\\Docker\\Docker\\Resources\\bin\\docker.exe: Error response from daemon: D: drive is not shared. Please share it in Docker for Windows Settings.` If so, do the following. On your tray, next to your clock, right-click on Docker, then click on Settings. You'll see the _Shared Drives_. Mark your drive and hit apply. See [this Docker forum thread](https://forums.docker.com/t/cannot-share-drive-in-windows-10/28798/5) for more tips.\n\n### No Permissions to Save Notebook Updates (Podman)\n\nThis is why the `--userns=keep-id` is used when running with `podman`. It should prevent the error that when you save your work, you get a permissions error. Please [open an issue](https://github.com/deanwampler/JustEnoughScalaForSpark/issues) if you encounter this problem.\n\nWhile investigating this issue previously, before discovering the `--userns=keep-id` flag, I found the following two (flawed) workarounds, both of which start with _File \u003e Save As_:\n\n1. Save the file to the same `notebooks` directory: This _appears_ to fail. You get an error dialog that the file couldn't be written, but in fact it was written. However, this only works once for a given target file. It won't be updated again and you still can't just save changes directly to it.\n2. Save and write the notebook to \"root\" folder as shown in the Jupyter UI, which is actually the `/home/jovyan` home directory in the container. That allows you to save changes and use features like _print_. _However_ any changes to the notebook in this directory **will be lost when you quit**, so you'll have to use _File \u003e Save Notebook As_ to download the latest version to your machine.\n\n\u003ca name=\"getting-help\"\u003e\u003c/a\u003e\n## Getting Help\n\nIf you're having problems, use the [Gitter chat room](https://gitter.im/deanwampler/JustEnoughScalaForSpark) to ask for help. If you are reasonably certain you have found a bug, post an issue to the [GitHub repo](https://github.com/deanwampler/JustEnoughScalaForSpark/issues). Recall that the `notebooks` directory also has a PDF of the notebook that you can read when the notebook won't work for some reason.\n\n## What's Next?\n\nYou are now ready to go through the tutorial.\n\nDon't want to run Spark Notebook to learn the material? A PDF printout of the notebook can also be found in the `notebooks` directory.\n\nFeedback, bug reports, and pull requests are [welcome](https://github.com/deanwampler/JustEnoughScalaForSpark)! Thanks.\n\nDean Wampler\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeanwampler%2FJustEnoughScalaForSpark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeanwampler%2FJustEnoughScalaForSpark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeanwampler%2FJustEnoughScalaForSpark/lists"}