{"id":13399754,"url":"https://github.com/deanwampler/spark-scala-tutorial","last_synced_at":"2025-05-16T09:04:20.992Z","repository":{"id":16602444,"uuid":"19357036","full_name":"deanwampler/spark-scala-tutorial","owner":"deanwampler","description":"A free tutorial for Apache Spark.","archived":false,"fork":false,"pushed_at":"2020-12-09T14:09:05.000Z","size":13313,"stargazers_count":989,"open_issues_count":3,"forks_count":434,"subscribers_count":77,"default_branch":"master","last_synced_at":"2025-05-12T21:52:20.457Z","etag":null,"topics":["jupyter","scala","spark","tutorial"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deanwampler.png","metadata":{"files":{"readme":"README.markdown","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-05-01T20:33:48.000Z","updated_at":"2025-04-20T14:09:46.000Z","dependencies_parsed_at":"2022-07-13T11:40:34.406Z","dependency_job_id":null,"html_url":"https://github.com/deanwampler/spark-scala-tutorial","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deanwampler%2Fspark-scala-tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deanwampler%2Fspark-scala-tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deanwampler%2Fspark-scala-tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deanwampler%2Fspark-scala-tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deanwampler","download_url":"https://codeload.github.com/deanwampler/spark-scala-tutorial/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254501557,"owners_count":22081528,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["jupyter","scala","spark","tutorial"],"created_at":"2024-07-30T19:00:42.305Z","updated_at":"2025-05-16T09:04:20.930Z","avatar_url":"https://github.com/deanwampler.png","language":"Jupyter Notebook","readme":"# Apache Spark Scala Tutorial - README\n\n[![Join the chat at https://gitter.im/deanwampler/spark-scala-tutorial](https://badges.gitter.im/deanwampler/spark-scala-tutorial.svg)](https://gitter.im/deanwampler/spark-scala-tutorial?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge)\n\n![](http://spark.apache.org/docs/latest/img/spark-logo-hd.png)\n\nDean Wampler\u003cbr/\u003e\n[deanwampler@gmail.com](mailto:deanwampler@gmail.com)\u003cbr/\u003e\n[@deanwampler](https://twitter.com/deanwampler)\n\nThis tutorial demonstrates how to write and run [Apache Spark](http://spark.apache.org) applications using Scala with some SQL. I also teach a little Scala as we go, but if you already know Spark and you are more interested in learning just enough Scala for Spark programming, see my other tutorial [Just Enough Scala for Spark](https://github.com/deanwampler/JustEnoughScalaForSpark).\n\nYou can run the examples and exercises several ways:\n\n1. Notebooks, like [Jupyter](http://jupyter.org/) - The easiest way, especially for data scientists accustomed to _notebooks_.\n2. In an IDE, like [IntelliJ](https://www.jetbrains.com/idea/) - Familiar for developers.\n3. At the terminal prompt using the build tool [SBT](https://www.scala-sbt.org/).\n\nThis tutorial is mostly about learning Spark, but I teach you a little Scala as we go. If you are more interested in learning just enough Scala for Spark programming, see my new tutorial [Just Enough Scala for Spark](https://github.com/deanwampler/spark-scala-tutorial).\n\n\u003e **Notes:** \n\u003e\n\u003e 1. The current version of Spark used is 2.3.X, which is a bit old. (TODO!)\n\u003e 2. While the notebook approach is the easiest way to use this tutorial to learn Spark, the IDE and SBT options show details for creating Spark _applications_, i.e., writing executable programs you build and run, as well as examples that use the interactive Spark Shell.\n\n## Acknowledgments\n\nI'm grateful that several people have provided feedback, issue reports, and pull requests. In particular:\n\n* [Ivan Mushketyk](https://github.com/mushketyk)\n* [Andrey Batyuk](https://github.com/abatyuk)\n* [Colin Jones](https://github.com/trptcolin)\n* [Lutz Hühnken](https://github.com/lutzh)\n\n## Getting Help\n\nBefore describing the different ways to work with the tutorial, if you're having problems, use the [Gitter chat room](https://gitter.im/deanwampler/spark-scala-tutorial) to ask for help. You can also use the new [Discussions feature](https://github.com/deanwampler/spark-scala-tutorial/discussions) for the GitHub repo. If you're reasonably certain you've found a bug, post an issue to the [GitHub repo](https://github.com/deanwampler/spark-scala-tutorial/issues). Pull requests are welcome, too!!\n\n## Setup Instructions\n\nLet's get started...\n\n### Download the Tutorial\n\nBegin by cloning or downloading the tutorial GitHub project [github.com/deanwampler/spark-scala-tutorial](https://github.com/deanwampler/spark-scala-tutorial).\n\nNow Pick the way you want to work through the tutorial:\n\n1. Notebooks - Go [here](#use-notebooks)\n2. In an IDE, like IntelliJ - Go [here](#use-ide)\n3. At the terminal prompt using SBT - Go [here](#use-sbt)\n\n\u003ca name=\"use-notebooks\"\u003e\u003c/a\u003e\n## Using Notebooks\n\nThe easiest way to work with this tutorial is to use a [Docker](https://docker.com) image that combines the popular [Jupyter](http://jupyter.org/) notebook environment with all the tools you need to run Spark, including the Scala language. It's called the [all-spark-notebook](https://hub.docker.com/r/jupyter/all-spark-notebook/).  It bundles [Apache Toree](https://toree.apache.org/) to provide Spark and Scala access.\nThe [webpage](https://hub.docker.com/r/jupyter/all-spark-notebook/) for this Docker image discusses useful information like using Python as well as Scala, user authentication topics, running your Spark jobs on clusters, rather than local mode, etc.\n\nThere are other notebook options you might investigate for your needs:\n\n**Open source:**\n\n* [Polynote](https://polynote.org/) - A cross-language notebook environment with built-in Scala support. Developed by Netflix.\n* [Jupyter](https://ipython.org/) + [BeakerX](http://beakerx.com/) - a powerful set of extensions for Jupyter.\n* [Zeppelin](http://zeppelin-project.org/) - a popular tool in big data environments\n\n**Commercial:**\n\n* [Databricks](https://databricks.com/) - a feature-rich, commercial, cloud-based service\n\n## Installing Docker and the Jupyter Image\n\nIf you need to install Docker, follow the installation instructions at [docker.com](https://www.docker.com/products/overview) (the _community edition_ is sufficient).\n\nNow we'll run the docker image. It's important to follow the next steps carefully. We're going to mount two local directories inside the running container, one for the data we want to use so and one for the notebooks.\n\n* Open a terminal or command window\n* Change to the directory where you expanded the tutorial project or cloned the repo\n* To download and run the Docker image, run the following command: `run.sh` (MacOS and Linux) or `run.bat` (Windows)\n\nThe MacOS and Linux `run.sh` command executes this command:\n\n```bash\ndocker run -it --rm \\\n  -p 8888:8888 -p 4040:4040 \\\n  --cpus=2.0 --memory=2000M \\\n  -v \"$PWD/data\":/home/jovyan/data \\\n  -v \"$PWD/notebooks\":/home/jovyan/notebooks \\\n  \"$@\" \\\n  jupyter/all-spark-notebook\n```\n\nThe Windows `run.bat` command is similar, but uses Windows conventions.\n\nThe `--cpus=... --memory=...` arguments were added because the notebook \"kernel\" is prone to crashing with the default values. Edit to taste. Also, it will help to keep only one notebook (other than the Introduction) open at a time.\n\nThe `-v PATH:/home/jovyan/dir` tells Docker to mount the `dir` directory under your current working directory, so it's available as `/home/jovyan/dir` inside the container. _This is essential to provide access to the tutorial data and notebooks_. When you open the notebook UI (discussed shortly), you'll see these folders listed.\n\n\u003e **Note:** On Windows, you may get the following error: _C:\\Program Files\\Docker\\Docker\\Resources\\bin\\docker.exe: Error response from daemon: D: drive is not shared. Please share it in Docker for Windows Settings.\"_ If so, do the following. On your tray, next to your clock, right-click on Docker, then click on Settings. You'll see the _Shared Drives_. Mark your drive and hit apply. See [this Docker forum thread](https://forums.docker.com/t/cannot-share-drive-in-windows-10/28798/5) for more tips.\n\nThe `-p 8888:8888 -p 4040:4040` arguments tells Docker to \"tunnel\" ports 8888 and 4040 out of the container to your local environment, so you can get to the Jupyter UI at port 8888 and the Spark driver UI at 4040.\n\nYou should see output similar to the following:\n\n```bash\nUnable to find image 'jupyter/all-spark-notebook:latest' locally\nlatest: Pulling from jupyter/all-spark-notebook\ne0a742c2abfd: Pull complete\n...\ned25ef62a9dd: Pull complete\nDigest: sha256:...\nStatus: Downloaded newer image for jupyter/all-spark-notebook:latest\nExecute the command: jupyter notebook\n...\n[I 19:08:15.017 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).\n[C 19:08:15.019 NotebookApp]\n\n    Copy/paste this URL into your browser when you connect for the first time,\n    to login with a token:\n        http://localhost:8888/?token=...\n```\n\nNow copy and paste the URL shown in a browser window. (Use command+click in your terminal window on MacOS.)\n\n\u003e **Warning:** When you quit the Docker container at the end of the tutorial, all your changes will be lost, unless they are in the `data` and `notebooks` directories that we mounted! To save notebooks you defined in other locations, export them using the _File \u003e Download as \u003e Notebook_ menu item in toolbar.\n\n## Running the Tutorial\n\nIn the Jupyter UI, you should see three folders, `data`, `notebooks`, and `work`. The first two are the folders we mounted. The data we'll use is in the `data` folder. The notebooks we'll use are... you get the idea.\n\nOpen the `notebooks` folder and click the link for `00_Intro.ipynb`.\n\nIt opens in a new browser tab. It may take several seconds to load.\n\n\u003e  **Tip:** If the new tab fails to open or the notebook fails to load as shown, check the terminal window where you started Jupyter. Are there any error messages?\n\nIf you're new to Jupyter, try _Help \u003e User Interface Tour_ to learn how to use Jupyter. At a minimum, you need to new that the content is organized into _cells_. You can navigate with the up and down arrows or clicks. When you come to a cell with code, either click the _run_ button in the toolbar or use shift+return to execute the code.\n\nRead through the Introduction notebook, then navigate to the examples using the table near the bottom. I've set up the table so that clicking each link opens a new browser tab.\n\n\u003ca name=\"use-ide\"\u003e\u003c/a\u003e\n### Use an IDE\n\nThe tutorial is also set up as a using the build tool [SBT](https://www.scala-sbt.org/). The popular IDEs, like [IntelliJ](https://www.jetbrains.com/idea/) with the Scala plugin (required) and [Eclipse with Scala](http://scala-ide.org/), can import an SBT project and automatically create an IDE project from it.\n\nOnce imported, you can run the Spark job examples as regular applications. There are some examples implemented as scripts that need to be run using the Spark Shell or the SBT console. The tutorial goes into the details.\n\nYou are now ready to go through the [tutorial](Tutorial.markdown).\n\n\u003ca name=\"use-sbt\"\u003e\u003c/a\u003e\n### Use SBT in a Terminal\n\nUsing [SBT](https://www.scala-sbt.org/) in a terminal is a good approach if you prefer to use a code editor like Emacs, Vim, or SublimeText. You'll need to [install SBT](https://www.scala-sbt.org/download.html), but not Scala or Spark. Those dependencies will be resolved when you build the software.\n\nStart the `sbt` console, then build the code, where the `sbt:spark-scala-tutorial\u003e` is the prompt I've configured for the project. Running `test` compiles the code and runs the tests, while `package` creates a jar file of the compiled code and configuration files:\n\n```shell\n$ sbt\n...\nsbt:spark-scala-tutorial\u003e test\n...\nsbt:spark-scala-tutorial\u003e package\n...\nsbt:spark-scala-tutorial\u003e\n```\n\nYou are now ready to go through the [tutorial](Tutorial.markdown).\n\n## Going Forward from Here\n\nTo learn more, see the following resources:\n\n* [Lightbend's Fast Data Platform](http://lightbend.com/fast-data-platform) - a curated, fully-supported distribution of open-source streaming and microservice tools, like Spark, Kafka, HDFS, Akka Streams, etc.\n* The Apache Spark [website](http://spark.apache.org/).\n* [Talks from the Spark Summit conferences](http://spark-summit.org).\n* [Learning Spark](http://shop.oreilly.com/product/0636920028512.do), an excellent introduction from O'Reilly, if now a bit dated.\n\n## Final Thoughts\n\nThank you for working through this tutorial. Feedback and pull requests are welcome.\n\n[Dean Wampler](mailto:dean.wampler@lightbend.com)\n","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeanwampler%2Fspark-scala-tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeanwampler%2Fspark-scala-tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeanwampler%2Fspark-scala-tutorial/lists"}