Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/atdixon/thurber
Clojure ++ Apache Beam ++ Google Cloud Dataflow
https://github.com/atdixon/thurber
Last synced: 2 months ago
JSON representation
Clojure ++ Apache Beam ++ Google Cloud Dataflow
- Host: GitHub
- URL: https://github.com/atdixon/thurber
- Owner: atdixon
- License: epl-1.0
- Created: 2019-11-03T21:23:08.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2022-11-11T17:51:17.000Z (about 2 years ago)
- Last Synced: 2024-10-28T13:42:28.843Z (2 months ago)
- Language: Clojure
- Homepage:
- Size: 630 KB
- Stars: 111
- Watchers: 10
- Forks: 7
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-beam - thruber - Clojure wrapper for Apache Beam
README
# thurber
![thurber](img/thurber.png)
[![Clojars Project](https://img.shields.io/clojars/v/com.github.atdixon/thurber.svg)](https://clojars.org/com.github.atdixon/thurber)
[Apache Beam](https://beam.apache.org/) and
[Google Cloud Dataflow](https://cloud.google.com/dataflow) on
~~steroids~~ Clojure. The [walkthrough](./demo/walkthrough.clj) explains everything.Release notes are [here](https://github.com/atdixon/thurber/releases).
* [Quickstart](#quickstart)
* [Project Goals](#project-goals)
* [Documentation](#documentation)
* [Demos](#demos)
* [Word Count](#word-count)
* [Mobile Gaming Examples](#mobile-gaming-examples)
* [I/O Transforms](#io-transforms)
* [Performance](#performance)
* [Tips](#performance-tuning-tips)
* [More Help](#more-help)
* [Donate](#donate)## Quickstart
1. Clone & `cd` into this repository.
2. `lein repl`
3. Copy & paste:```clojure
(ns try-thurber
(:require [thurber :as th]
[thurber.sugar :refer :all]))(->
(th/create-pipeline)(th/apply!
(read-text-file
"demo/word_count/lorem.txt")
(th/fn* extract-words [sentence]
(remove empty? (.split sentence "[^\\p{L}]+")))
(count-per-element)
(th/fn* format-as-text
[[k v]] (format "%s: %d" k v))
(log-sink))(th/run-pipeline!))
```Output:
```
...
INFO thurber - extremely: 1
INFO thurber - undertakes: 1
INFO thurber - pleasure: 7
INFO thurber - you: 2
...
```## Project Goals
* **Enable Clojure**
* Bring Clojure's powerful, expressive toolkit (destructuring,
immutability, REPL, async tools, etc etc) to Apache Beam.
* **REPL Oriented**
* Functions are idiomatic/pure Clojure functions by default. (E.g., lazy
sequences are supported making iterative event output optional/unnecessary, etc.)
* Develop and test pipelines incrementally from the REPL.
* Evaluate/learn Beam semantics (windowing, triggering) interactively.
* **Avoid Macros**
* Limit macro infection. Most thurber constructions are macro-less, use of any
thurber macro constructions (like inline functions) is optional.
* **AOT Nothing**
* Fully dynamic experience. Reload namespaces at whim. thurber's dependency on
Beam, Clojure, etc versions are completely dynamic/floatable. No forced AOT'd
dependencies, Etc.
* **No Lock-in**
* Pipelines can be composed of Clojure and Java transforms.
Incrementally refactor your pipeline to Clojure or back to Java.
* **Not Afraid of Java Interop**
* Wherever Clojure's [Java Interop](https://clojure.org/reference/java_interop)
is performant and works cleanly with Beam's fluent API, encourage it; facade/sugar
functions are simple to create and left to your own domain-specific implementations.
* **Completeness**
* Support all Beam capabilities (Transforms, State & Timers, Side Inputs,
Output Tags, etc.)
* **Performance**
* Be finely tuned for data streaming.## Documentation
* [Code Walkthrough](./demo/walkthrough.clj)
* [Troubleshooting](./doc/troubleshooting.md)
* [Beam Tutorial](./doc/beam-tutorial.md)## Demos
Each namespace in the `demo/` source directory is a pipeline written in Clojure
using thurber. Comments in the source highlight salient aspects of thurber usage.Along with the [code walkthrough](./demo/walkthrough.clj) these are the best way to learn
thurber's API and serve as recipes for various scenarios (use of tags, side inputs,
windowing, combining, Beam's State API, etc etc.)To execute a demo, start a REPL and evaluate `(demo!)` from within the respective namespace.
### Word Count
The `word_count` package contains ports of Beam's
[Word Count Examples](https://beam.apache.org/get-started/wordcount-example/)
to Clojure/thurber.### Mobile Gaming Examples
Beam's Mobile Gaming Examples (documented [here](https://beam.apache.org/get-started/mobile-gaming-example/))
have been ported to Clojure using thurber.These are fully functional ports. They require deployment to GCP Dataflow:
* How to Run Beam Mobile Gaming Examples (thurber): [Detailed Instructions](./doc/running-mobile-gaming-examples.md)
## I/O Transforms
Beam has many I/O transforms — see [here](https://beam.apache.org/documentation/io/built-in/).
KafkaIO, for example, has some configuration nuances:
* [./demo/kafka/simple-consumer](./demo/kafka/simple_consumer.clj)
shows how to configure a Kafka-consuming pipeline using thurber/ClojureIf you need help using thurber/Clojure with another I/O transform, you can
[open an issue](https://github.com/atdixon/thurber/issues?utf8=✓&q=is%3Aissue+label%3Ademo+)
to request any thurber demo code you'd like to see.## Performance
Streaming/big data implies hot code paths. thurber's core has been tuned for performance in various ways,
but you may benefit from tuning your own pipeline code:### Performance Tuning Tips
* Use Clojure [**type hints**](https://clojure.org/reference/java_interop#typehints)
liberally within your stream functions.
- The cost of Java method invocation-by-reflection can be very high, and type hints can have a large
impact in these cases.
- A helpful list of primitive type hint aliases can be found [here](https://clojure.org/reference/java_interop#TypeAliases).
* Use Clojure's high-performance [**primitive operations**](https://clojure.org/reference/java_interop#primitives).
* Follow Clojure's [**optimization tips**](https://clojure.org/reference/java_interop#optimization).
- For example: `aget` is explicitly overloaded for primitive arrays — type hinting is key here.
* Compare **gaming demos** [user-score](./demo/game/user_score.clj) and [user-score-opt](./demo/game/user_score_opt.clj);
the latter is an optimized version of the former pipeline. (The optimized version here compares with the
performance of the Java demo in Beam source.)
* Be explicit which **JVM/JDK version** is executing your code at runtime. Mature JVM versions have stronger
performance in many cases than earlier versions.
- Note: Dataflow will pick a JVM/JDK version for your runtime/worker nodes based on the Java version you
use to launch your pipeline!
* **Profile** your pipeline!
- If deploying to GCP, use [Dataflow profiling](https://medium.com/google-cloud/profiling-dataflow-pipelines-ddbbef07761d)
to zero in on areas to optimize.
* When in doubt or in a bind, you can always fall back to Java for sensitive code paths.
- Note: This rarely if ever should be needed to achieve optimal performance.
* In general (this is not Clojure/thurber-specific) you should understand Beam "fusion" and when to **break fusion** to achieve
greater linear scalability. More info [here](https://beam.apache.org/contribute/ptransform-style-guide/#performance).## More Help
* Ask a question by [opening an issue](https://github.com/atdixon/thurber/issues?utf8=✓&q=is%3Aissue+label%3Aquestion+).
## References
* https://write.as/aaron-d/clojure-data-streaming-and-dodging-static-types
* https://tech.redplanetlabs.com/2020/09/02/clojure-faster/## License
Copyright © 2020 Aaron DixonLike Clojure distributed under the Eclipse Public License.