{"id":305131,"url":"https://github.com/tweag/sparkle","last_synced_at":"2025-05-16T09:04:00.079Z","repository":{"id":41258201,"uuid":"45848386","full_name":"tweag/sparkle","owner":"tweag","description":"Haskell on Apache Spark.","archived":false,"fork":false,"pushed_at":"2023-02-01T17:37:04.000Z","size":1155,"stargazers_count":449,"open_issues_count":16,"forks_count":27,"subscribers_count":67,"default_branch":"master","last_synced_at":"2025-04-25T14:59:51.742Z","etag":null,"topics":["analytics","apache-spark","haskell","spark"],"latest_commit_sha":null,"homepage":"","language":"Haskell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tweag.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2015-11-09T15:49:47.000Z","updated_at":"2025-01-21T20:19:31.000Z","dependencies_parsed_at":"2023-02-12T04:33:05.007Z","dependency_job_id":null,"html_url":"https://github.com/tweag/sparkle","commit_stats":{"total_commits":755,"total_committers":26,"mean_commits":29.03846153846154,"dds":0.5456953642384106,"last_synced_commit":"2031e08e098aa149da4e5fcb7ee3c6167ea6d484"},"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tweag%2Fsparkle","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tweag%2Fsparkle/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tweag%2Fsparkle/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tweag%2Fsparkle/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tweag","download_url":"https://codeload.github.com/tweag/sparkle/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254501556,"owners_count":22081528,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","apache-spark","haskell","spark"],"created_at":"2024-01-07T07:56:18.678Z","updated_at":"2025-05-16T09:04:00.051Z","avatar_url":"https://github.com/tweag.png","language":"Haskell","funding_links":[],"categories":["Table of Contents","Packages"],"sub_categories":["Tools","Language Bindings"],"readme":"# sparkle: Apache Spark applications in Haskell\n\n[![Build](https://github.com/tweag/sparkle/actions/workflows/build.yml/badge.svg?branch=master)](https://github.com/tweag/sparkle/actions/workflows/build.yml)\n\n*sparkle [spär′kəl]:* a library for writing resilient analytics\napplications in Haskell that scale to thousands of nodes, using\n[Spark][spark] and the rest of the Apache ecosystem under the hood.\nSee [this blog post][hello-sparkle] for the details.\n\n[spark]: http://spark.apache.org/\n[hello-sparkle]: http://www.tweag.io/posts/2016-02-25-hello-sparkle.html\n\n## Getting started\n\nThe tl;dr using the `hello` app as an example on your local machine:\n```\n$ nix-shell --pure --run \"bazel build //apps/hello:sparkle-example-hello_deploy.jar\"\n$ nix-shell --pure --run \"bazel run spark-submit -- --packages com.amazonaws:aws-java-sdk:1.11.920,org.apache.hadoop:hadoop-aws:2.10.2 $PWD/bazel-bin/apps/hello/sparkle-example-hello_deploy.jar\"\n```\n\nYou'll need [Nix][nix] for the above to work.\n\n## How it works\n\nsparkle is a tool for creating self-contained Spark applications in\nHaskell. Spark applications are typically distributed as JAR files, so\nthat's what sparkle creates. We embed Haskell native object code as\ncompiled by GHC in these JAR files, along with any shared library\nrequired by this object code to run. Spark dynamically loads this\nobject code into its address space at runtime and interacts with it\nvia the Java Native Interface (JNI).\n\n## How to use\n\nTo run a Spark application the process is as follows:\n\n1. **create** an application in the `apps/` folder, in-repo or as\n   a submodule;\n1. **build** the app;\n1. **submit** it to a local or cluster deployment of Spark.\n\n**If you run into issues, read the Troubleshooting section below\n  first.**\n\n### Build\n\n#### Linux\n\nInclude the following in a `BUILD.bazel` file next to your source code.\n```\npackage(default_visibility = [\"//visibility:public\"])\n\nload(\n  \"@rules_haskell//haskell:defs.bzl\",\n  \"haskell_library\",\n)\n\nload(\"@io_tweag_sparkle//:sparkle.bzl\", \"sparkle_package\")\n\n# hello-hs needs to contain a Main module with a main function.\n# This main function will be invoked by spark.\nhaskell_library (\n  name = \"hello-hs\",\n  srcs = ...,\n  deps = ...,\n  ...\n)\n\nsparkle_package(\n  name = \"sparkle-example-hello\",\n  src = \":hello-hs\",\n)\n```\n\nYou might want to add the following settings to your `.bazelrc.local`\nfile.\n```\ncommon --repository_cache=~/.bazel_repo_cache\ncommon --disk_cache=~/.bazel_disk_cache\ncommon --local_cpu_resources=4\n```\n\nAnd then ask [Bazel][bazel] to build a *deploy* jar file.\n\n```\n$ nix-shell --pure --run \"bazel build //apps/hello:sparkle-example-hello_deploy.jar\"\n```\n\n#### Other platforms\n\n`sparkle` builds in Mac OS X, but running it requires installing binaries\nfor `Spark` and maybe `Hadoop` (See [.github/workflows/build.yml](.github/workflows/build.yml).\n\nAnother alternative is to build and run `sparkle` via Docker in non-Linux\nplatforms, using a docker image provisioned with Nix.\n\n#### Integrating `sparkle` in another project\n\nAs `sparkle` interacts with the JVM, you need to tell `ghc`\nwhere JVM-specific headers and libraries are. It needs to be able to\nlocate `jni.h`, `jni_md.h` and `libjvm.so`.\n\n`sparkle` uses `inline-java` to embed fragments of Java code in Haskell\nmodules, which requires running the `javac` compiler, which must be\navailable in the `PATH` of the shell. Moreover, `javac` needs to find\nthe Spark classes that `inline-java` quotations refer to. Therefore,\nthese classes need to be added to the `CLASSPATH` when building sparkle.\nDependending on your build system, how to do this might vary. In this\nrepo, we use `gradle` to install Spark, and we query `gradle` to get\nthe paths we need to add to the `CLASSPATH`.\n\nAdditionally, the classes need to be found at runtime to load them.\nThe main thread can find them, but other threads need to invoke\n`initializeSparkThread` or `runInSparkThread` from\n`Control.Distributed.Spark`.\n\nIf the `main` function terminates with unhandled exceptions, they\ncan be propagated to Spark with\n`Control.Distributed.Spark.forwardUnhandledExceptionsToSpark`. This\nallows spark both to report the exception and to cleanup before\ntermination.\n\n### Submit\n\nFinally, to run your application, for example locally:\n\n```\n$ nix-shell --pure --run \"bazel run spark-submit -- /path/to/$PWD/\u003capp-target-name\u003e_deploy.jar\"\n```\n\nThe `\u003capp-target-name\u003e` is the name of the Bazel target producing the jar file. See apps in\nthe [apps/](apps/) folder for examples.\n\nRTS options can be passed as a java property\n```\n$ nix-shell --pure --run \"bazel run spark-submit -- --driver-java-options=-Dghc_rts_opts='+RTS\\ -s\\ -RTS' \u003capp-target-name\u003e_deploy.jar\n```\nor as command line arguments\n```\n$ nix-shell --pure --run \"bazel run spark-submit -- \u003capp-target-name\u003e_deploy.jar +RTS -s -RTS\n```\n\nSee [here][spark-submit] for other options, including launching\na [whole cluster from scratch on EC2][spark-ec2]. This\n[blog post][tweag-blog-haskell-paas] shows you how to get started on\nthe [Databricks hosted platform][databricks] and on\n[Amazon's Elastic MapReduce][aws-emr].\n\n[bazel]: https://bazel.build\n[docker-build-img]: https://hub.docker.com/r/tweag/sparkle/\n[spark-submit]: http://spark.apache.org/docs/1.6.2/submitting-applications.html\n[spark-ec2]: http://spark.apache.org/docs/1.6.2/ec2-scripts.html\n[nix]: http://nixos.org/nix\n[tweag-blog-haskell-paas]: http://www.tweag.io/posts/2016-06-20-haskell-compute-paas-with-sparkle.html\n[databricks]: https://databricks.com/\n[aws-emr]: https://aws.amazon.com/emr/\n\n## Troubleshooting\n\n### JNI calls in auxiliary threads fail with ClassNotFoundException\n\nThe context class loader of threads needs to be set appropriately\nbefore JNI calls can find classes in Spark. Calling\n`initializeSparkThread` or `runInSparkThread` from\n`Control.Distributed.Spark` should set it.\n\n### Anonymous classes in inline-java quasiquotes fail to deserialize\n\nWhen using inline-java, it is recommended to use the Kryo serializer,\nwhich is currently not the default in Spark but is faster anyways. If\nyou don't use the Kryo serializer, objects of anonymous class, which\narise e.g. when using Java 8 function literals,\n\n```haskell\nfoo :: RDD Int -\u003e IO (RDD Bool)\nfoo rdd = [java| $rdd.map((Integer x) -\u003e x.equals(0)) |]\n```\n\nwon't be deserialized properly in multi-node setups. To avoid this\nproblem, switch to the Kryo serializer by setting the following\nconfiguration properties in your `SparkConf`:\n\n```haskell\ndo conf \u003c- newSparkConf \"some spark app\"\n   confSet conf \"spark.serializer\" \"org.apache.spark.serializer.KryoSerializer\"\n   confSet conf \"spark.kryo.registrator\" \"io.tweag.sparkle.kryo.InlineJavaRegistrator\"\n```\n\nSee [#104](https://github.com/tweag/sparkle/issues/104) for more\ndetails.\n\n### java.lang.UnsatisfiedLinkError: /tmp/sparkle-app...: failed to map segment from shared object\n\nSparkle unzips the Haskell binary program in a temporary location on\nthe filesystem and then loads it from there. For loading to succeed, the\ntemporary location must not be mounted with the `noexec` option.\nAlternatively, the temporary location can be changed with\n```\nspark-submit --driver-java-options=\"-Djava.io.tmpdir=...\" \\\n             --conf \"spark.executor.extraJavaOptions=-Djava.io.tmpdir=...\"\n```\n\n### java.io.IOException: No FileSystem for scheme: s3n\n\nSpark 2.4 requires explicitly specifying extra JAR files to `spark-submit`\nin order to work with AWS. To work around this, add an additional 'packages'\nargument when submitting the job:\n\n```\nspark-submit --packages com.amazonaws:aws-java-sdk:1.11.920,org.apache.hadoop:hadoop-aws:2.8.4\n```\n\n## License\n\nCopyright (c) 2015-2016 EURL Tweag.\n\nAll rights reserved.\n\nsparkle is free software, and may be redistributed under the terms\nspecified in the [LICENSE](LICENSE) file.\n\n## Sponsors\n\n\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n[![Tweag I/O](http://i.imgur.com/0HK8X4y.png)](http://tweag.io)\n\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n[![LeapYear](http://i.imgur.com/t9VxRHn.png)](http://leapyear.io)\n\nsparkle is maintained by [Tweag I/O](http://tweag.io/).\n\nHave questions? Need help? Tweet at\n[@tweagio](http://twitter.com/tweagio).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftweag%2Fsparkle","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftweag%2Fsparkle","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftweag%2Fsparkle/lists"}