{"id":32196731,"url":"https://github.com/ngrunwald/datasplash","last_synced_at":"2026-02-23T05:02:05.575Z","repository":{"id":30617273,"uuid":"34172649","full_name":"ngrunwald/datasplash","owner":"ngrunwald","description":"Clojure API for a more dynamic Google Dataflow","archived":false,"fork":false,"pushed_at":"2026-01-28T08:24:18.000Z","size":890,"stargazers_count":131,"open_issues_count":5,"forks_count":32,"subscribers_count":17,"default_branch":"master","last_synced_at":"2026-01-28T23:53:11.484Z","etag":null,"topics":["apache-beam","clojure","distributed-computing","google-cloud","google-dataflow"],"latest_commit_sha":null,"homepage":"https://cljdoc.org/d/datasplash/datasplash/","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"epl-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ngrunwald.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2015-04-18T16:11:03.000Z","updated_at":"2025-10-24T10:57:09.000Z","dependencies_parsed_at":"2024-05-07T12:41:53.722Z","dependency_job_id":"2ff5b322-7b62-4dd6-b763-4516b7d68858","html_url":"https://github.com/ngrunwald/datasplash","commit_stats":null,"previous_names":[],"tags_count":50,"template":false,"template_full_name":null,"purl":"pkg:github/ngrunwald/datasplash","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ngrunwald%2Fdatasplash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ngrunwald%2Fdatasplash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ngrunwald%2Fdatasplash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ngrunwald%2Fdatasplash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ngrunwald","download_url":"https://codeload.github.com/ngrunwald/datasplash/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ngrunwald%2Fdatasplash/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29738083,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-23T04:51:08.365Z","status":"ssl_error","status_checked_at":"2026-02-23T04:49:15.865Z","response_time":90,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-beam","clojure","distributed-computing","google-cloud","google-dataflow"],"created_at":"2025-10-22T02:41:31.663Z","updated_at":"2026-02-23T05:02:05.570Z","avatar_url":"https://github.com/ngrunwald.png","language":"Clojure","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Datasplash\n\n[![Clojars Project](https://img.shields.io/clojars/v/datasplash.svg)](https://clojars.org/datasplash)\n\n[![cljdoc badge](https://cljdoc.org/badge/datasplash/datasplash)](https://cljdoc.org/d/datasplash/datasplash/CURRENT)\n\n\nClojure API for a more dynamic [Google Cloud Dataflow][gcloud] and (not really\nbattle tested) any other [Apache Beam][beam] backend.\n\n[gcloud]: https://cloud.google.com/dataflow/\n[beam]: https://beam.apache.org/\n\n## Usage\n\n[API docs](https://cljdoc.org/d/datasplash/datasplash/CURRENT/api/datasplash)\n\nYou can also see ports of the official Dataflow examples in the\n`datasplash.examples` namespace.\n\nHere is the classic word count example.\n\n:information_source: You will need to run `(compile 'datasplash.examples)`\nevery time you make a change.\n\n```clojure\n(ns datasplash.examples\n  (:require [clojure.string :as str]\n            [datasplash.api :as ds]\n            [datasplash.options :refer [defoptions]])\n  (:gen-class))\n\n(defn tokenize\n  [^String l]\n  (remove empty? (.split (str/trim l) \"[^a-zA-Z']+\")))\n\n(defn count-words\n  [p]\n  (ds/-\u003e\u003e :count-words p\n          (ds/mapcat tokenize {:name :tokenize})\n          (ds/frequencies)))\n\n(defn format-count\n  [[k v]]\n  (format \"%s: %d\" k v))\n\n(defoptions WordCountOptions\n  {:input {:default \"gs://dataflow-samples/shakespeare/kinglear.txt\"\n           :type String}\n   :output {:default \"kinglear-freqs.txt\" :type String}\n   :numShards {:default 0 :type Long}})\n\n(defn -main\n  [\u0026 str-args]\n  (let [p (ds/make-pipeline WordCountOptions str-args)\n        {:keys [input output numShards]} (ds/get-pipeline-options p)]\n    (-\u003e\u003e p\n         (ds/read-text-file input {:name \"King-Lear\"})\n         (count-words)\n         (ds/map format-count {:name :format-count})\n         (ds/write-text-file output {:num-shards numShards})\n         (ds/run-pipeline))))\n```\n\n### Run it from the repl\n\nLocally on your machine using a DirectRunner:\n\n```clojure\n(in-ns 'datasplash.examples)\n(clojure.core/compile 'datasplash.examples)\n(-main \"--input=sometext.txt\" \"--output=out-freq.txt\" \"--numShards=1\")\n```\n\nRemotely on Google Cloud using a DataflowRunner:\n\nYou should have properly configured your Google Cloud account and Dataflow\naccess from your machine.\n\n```clojure\n(in-ns 'datasplash.examples)\n(clojure.core/compile 'datasplash.examples)\n(-main \"--project=my-project\"\n       \"--runner=DataflowRunner\"\n       \"--gcpTempLocation=gs://bucket/tmp\"\n       \"--input=gs://apache-beam-samples/shakespeare/kinglear.txt\"\n       \"--output=gs://bucket/outputs/kinglear-freq.txt\"\n       \"--numShards=1\")\n```\n\n### Run it as a standalone program\n\nDatasplash needs to be AOT compiled, so you should prepare an uberjar and\nrun from your main entry like so:\n\n```bash\njava -jar my-dataflow-job-uber.jar [beam-args]\n```\n\n\n## Caveats\n\n- Due to the way the code is loaded when running in distributed mode, you may\n  get some exceptions about unbound vars, especially when using instances with\n  a high number of cpu. They will not however cause the job to fail and are of\n  no consequences. They are caused by the need to prep the Clojure runtime when\n  loading the class files in remote instances and some tricky business with\n  locks and `require`.\n- If you have to write your own low-level `ParDo` objects (you shouldn't), wrap\n  all your code in the `safe-exec` macro to avoid issues with unbound vars. Any\n  good idea about finding a better way to do this would be greatly appreciated!\n- If some of the `UserCodeException` as seen in the cloud UI are mangled and\n  missing the relevant part of the Clojure source code, this is due to a bug\n  with the way the sdk mangles stacktraces in Clojure. In this case look for\n  _ClojureRuntimeException_ in the logs to find the original unaltered\n  stacktrace.\n- Beware of using Clojure 1.9: `proxy` results are not `Serializable` anymore,\n  so you cannot use anywhere in your pipeline Clojure code that uses proxy. Use\n  Java shim for these objects instead.\n- If you see something like `java.lang.ClassNotFoundException: Options` you\n  probably forgot to compile your namespace.\n- Whenever you need to check some spec in user code, you will have to first require\n  those specs because they may not be loaded in your Clojure runtime. But don't\n  use `(require)` because it's not thread safe.\n  See [[this issue]](https://clojure.atlassian.net/browse/CLJ-1876) for a workaround.\n- If you see a `java.io.IOException: No such file or directory` when invoking\n  `compile`, make sure there is a directory in your project root that matches\n  the value of `*compile-path*` (default `classes`).\n\n## About compression libraries\n\nThe Beam Java SDK does not pull in the Zstd library by default, so it is the\nuser's responsibility to declare an explicit dependency on `zstd-jni`. Attempts\nto read or write _.zst_ files without this library loaded will result in\n`NoClassDefFoundError` at runtime.\n\nThe Beam Java SDK does not pull in the required libraries for LZOP compression\nby default, so it is the user's responsibility to declare an explicit\ndependency on `io.airlift:aircompressor` and\n`com.facebook.presto.hadoop:hadoop-apache2`. Attempts to read or write _.lzo_\nfiles without those libraries loaded will result in a `NoClassDefFoundError`\nat runtime.\n\nSee Apache Beam [Compression enum][] for details.\n\n[Compression enum]: https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/Compression.html\n\n## License\n\nCopyright © 2015-2026 Oscaro.com\n\nDistributed under the Eclipse Public License either version 1.0 or (at your\noption) any later version.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fngrunwald%2Fdatasplash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fngrunwald%2Fdatasplash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fngrunwald%2Fdatasplash/lists"}