{"id":13682752,"url":"https://github.com/alexarchambault/ammonite-spark","last_synced_at":"2025-03-17T16:13:08.240Z","repository":{"id":33758143,"uuid":"142422727","full_name":"alexarchambault/ammonite-spark","owner":"alexarchambault","description":"Run spark calculations from Ammonite","archived":false,"fork":false,"pushed_at":"2024-08-21T12:54:36.000Z","size":935,"stargazers_count":118,"open_issues_count":37,"forks_count":18,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-10-11T08:46:13.874Z","etag":null,"topics":["ammonite","scala","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alexarchambault.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-07-26T09:58:28.000Z","updated_at":"2024-09-11T06:00:59.000Z","dependencies_parsed_at":"2024-01-14T16:07:31.134Z","dependency_job_id":"9b2ba48d-cf4e-4a9a-9e51-51790199ac57","html_url":"https://github.com/alexarchambault/ammonite-spark","commit_stats":{"total_commits":241,"total_committers":7,"mean_commits":34.42857142857143,"dds":"0.46058091286307057","last_synced_commit":"d30e05a733d81ce3523caab5602080243be78688"},"previous_names":[],"tags_count":44,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexarchambault%2Fammonite-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexarchambault%2Fammonite-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexarchambault%2Fammonite-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexarchambault%2Fammonite-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alexarchambault","download_url":"https://codeload.github.com/alexarchambault/ammonite-spark/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244066189,"owners_count":20392406,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ammonite","scala","spark"],"created_at":"2024-08-02T13:01:52.551Z","updated_at":"2025-03-17T16:13:08.211Z","avatar_url":"https://github.com/alexarchambault.png","language":"Scala","funding_links":[],"categories":["Scala","Containers \u0026 Language Extentions \u0026 Linting"],"sub_categories":["For Scala"],"readme":"# ammonite-spark\n\nRun [spark](https://spark.apache.org/) calculations from [Ammonite](http://ammonite.io/)\n\n[![Build Status](https://github.com/alexarchambault/ammonite-spark/actions/workflows/ci.yml/badge.svg)](https://github.com/alexarchambault/ammonite-spark/actions/workflows/ci.yml?query=branch%3Amain)\n\n*ammonite-spark* allows to create SparkSessions from Ammonite. It passes some Ammonite internals to a `SparkSession`, so that spark calculations can be driven from Ammonite, as one would do from a [spark-shell](https://spark.apache.org/docs/2.3.1/quick-start.html#interactive-analysis-with-the-spark-shell).\n\n\u003cimg src=\"ammonite-spark.png\" width=\"800\"\u003e\n\n## Table of content\n\n1. [Quick start](#quick-start)\n2. [`AmmoniteSparkSession` vs `SparkSession`](#ammonitesparksession-vs-sparksession)\n   1. [Syncing dependencies](#syncing-dependencies)\n3. [Using with standalone cluster](#using-with-standalone-cluster)\n4. [Using with YARN cluster](#using-with-yarn-cluster)\n5. [Troubleshooting](#troubleshooting)\n6. [Compatibility](#compatibility)\n\n\n\n## Quick start\n\nStart Ammonite \u003e= [`1.6.3`](https://github.com/lihaoyi/Ammonite/releases/download/1.6.3/2.11-1.6.3), with the `--class-based` option.\nThe `--tmp-output-directory` option, available since Ammonite `3.0.0-M0-14-c12b6a59` is also recommended, especially in \"tight\" network environments (Kubernetes, …). The [compatibility section](#compatibility) lists the compatible versions of Ammonite and ammonite-spark. Start Ammonite by either following [the Ammonite instructions](http://ammonite.io/#Ammonite-REPL) on its website, then do\n```\n$ amm --class-based --tmp-output-directory\n```\nor use [coursier](https://github.com/coursier/coursier),\n```\n$ cs launch ammonite:3.0.0-M0-60-89836cd8 --scala 2.13.12 -- --class-based --tmp-output-directory\n```\nor [Scala CLI](https://github.com/VirtusLab/scala-cli)\n```\n$ scala-cli repl --amm --ammonite-version 3.0.0-M0-60-89836cd8 --scala 2.13.12 -- --class-based --tmp-output-directory\n```\nEnsure you are using scala 2.12 or 2.13, the only supported Scala versions as of writing this.\n\nAt the Ammonite prompt, load the Spark 2.x or 3.x version of your choice, along with ammonite-spark,\n```scala\n@ import $ivy.`org.apache.spark::spark-sql:3.3.0`\n@ import $ivy.`sh.almond::ammonite-spark:0.13.12`\n```\n(Note the two `::` before `spark-sql` or `ammonite-spark`, as these are scala dependencies.)\n\nThen create a `SparkSession` using the builder provided by *ammonite-spark*\n```scala\n@ import org.apache.spark.sql._\n\n@ val spark = {\n    AmmoniteSparkSession.builder()\n      .master(\"local[*]\")\n      .getOrCreate()\n  }\n```\n\nNote the use of `AmmoniteSparkSession.builder()`, instead of `SparkSession.builder()` that one would use when e.g. writing a Spark job.\n\nThe builder returned by `AmmoniteSparkSession.builder()` extends the one of `SparkSession.builder()`, so that one can call `.appName(\"foo\")`, `.config(\"key\", \"value\")`, etc. on it.\n\nSee below for how to use it with [standalone clusters](#using-with-standalone-cluster), and how to use it with [YARN clusters](#using-with-yarn-cluster).\n\nNote that *ammonite-spark* does *not* rely on a Spark distribution. The driver and executors classpaths are handled from the Ammonite session only, via ``import $ivy.`…` `` statements. See [INTERNALS](https://github.com/alexarchambault/ammonite-spark/blob/develop/INTERNALS.md) for more details.\n\nYou can then run Spark calculations, like\n```scala\n@ def sc = spark.sparkContext // 'def' recommended over 'val', to workaround SparkContext Java serialization issues\n\n@ val rdd = sc.parallelize(1 to 100, 10)\n\n@ val n = rdd.map(_ + 1).sum()\n```\n\n## Using with standalone cluster\n\nSimply set the master to `spark://…` when building the session, e.g.\n```scala\n@ val spark = {\n    AmmoniteSparkSession.builder()\n      .master(\"spark://localhost:7077\")\n      .config(\"spark.executor.instances\", \"4\")\n      .config(\"spark.executor.memory\", \"2g\")\n      .getOrCreate()\n  }\n```\n\nEnsure the version of Spark used to start the master and executors matches the one loaded in the Ammonite session (via e.g. ``import $ivy.`org.apache.spark::spark-sql:X.Y.Z` ``), and that the machine running Ammonite can access / is accessible from all nodes of the standalone cluster.\n\n## Using with YARN cluster\n\nSet the master to `\"yarn\"` when building the session, e.g.\n```scala\n@ val spark = {\n    AmmoniteSparkSession.builder()\n      .master(\"yarn\")\n      .config(\"spark.executor.instances\", \"4\")\n      .config(\"spark.executor.memory\", \"2g\")\n      .getOrCreate()\n  }\n```\n\nEnsure the configuration directory of the cluster is set in `HADOOP_CONF_DIR` or `YARN_CONF_DIR` in the environment, or is available at `/etc/hadoop/conf`. This directory should contain files like `core-site.xml`, `hdfs-site.xml`, … Ensure also that the machine you run Ammonite on can indeed act as the driver (it should have access to and be accessible from the YARN nodes, etc.).\n\nBefore raising issues, ensure you are aware of all that needs to be set up to get a working spark-shell from a Spark distribution, and that all of them are passed in one way or another to the SparkSession created from Ammonite.\n\n## Troubleshooting\n\n### Getting `org.apache.spark.sql.AnalysisException` when calling `.toDS`\n\nAdd `org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)` on the same lines as those where you define case classes involved, like\n```scala\n@ import spark.implicits._\nimport spark.implicits._\n\n@ org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this); case class Foo(id: String, value: Int)\ndefined class Foo\n\n@  val ds = List(Foo(\"Alice\", 42), Foo(\"Bob\", 43)).toDS\nds: Dataset[Foo] = [id: string, value: int]\n```\n\n(This should likely be added automatically in the future.)\n\n## Compatibility\n\nammonite-spark relies on the API of Ammonite, which undergoes\nnon backward compatible changes from time to time. The following table lists\nwhich versions of Ammonite ammonite-spark is built against - so is compatible\nwith for sure.\n\n| ammonite-spark   | Ammonite | almond |\n|------------------|----------|--------|\n| `0.1.2`, `0.1.3` | `1.3.2`  |        |\n| `0.2.0`          | `1.5.0`  | `0.2.0` |\n| `0.3.0`          | `1.6.3`  | `0.3.0` |\n| `0.4.0`          | `1.6.5`  | `0.4.0` |\n| `0.4.1`          | `1.6.6`  | `0.5.0` |\n| `0.4.2`          | `1.6.7`  | `0.5.0` |\n| `0.5.0`          | `1.6.9-8-2a27ffe`  | `0.6.0` |\n| `0.6.0`, `0.6.1` | `1.6.9-15-6720d42`  | `0.7.0`, `0.8.0` |\n| `0.7.0`          | `1.7.1`  | `0.8.1` |\n| `0.7.1`          | `1.7.3-3-b95f921`  |         |\n| `0.7.2`          | `1.7.4`  | `0.8.2`, `0.8.3` |\n| `0.8.0`          | `1.8.1`  |         |\n| `0.9.0`          | `2.0.4`  |         |\n| `0.10.0`          | `2.1.4`  | `0.10.0` |\n| `0.10.1`          | `2.1.4`  | `0.10.1` |\n| `0.11.0`          | `2.3.8-36-1cce53f3`  | `0.11.0` |\n| `0.12.0`          | `2.3.8-122-9be39deb`  | skipped |\n| `0.13.0`          | `2.5.4-8-30448e49` | `0.13.0` |\n| `0.13.1`          | `2.5.4-13-1ebd00a6` | `0.13.1` |\n| `0.13.2`          | `2.5.4-14-dc4c47bc` | `0.13.2` |\n| ...               | ...                 | ...      |\n[ `0.13.9`          | `3.0.0-M0-17-e7a04255` | `0.13.11` |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexarchambault%2Fammonite-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falexarchambault%2Fammonite-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexarchambault%2Fammonite-spark/lists"}