Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/com-lihaoyi/geny

Provides the geny.Generator data type, the dual to a scala.Iterator that can ensure resource cleanup
https://github.com/com-lihaoyi/geny

scala

Last synced: 5 days ago
JSON representation

Provides the geny.Generator data type, the dual to a scala.Iterator that can ensure resource cleanup

Awesome Lists containing this project

README

        

= Geny
:version: 1.1.1
:toc-placement: preamble
:toc:
:link-geny: https://github.com/com-lihaoyi/geny
:link-oslib: https://github.com/com-lihaoyi/os-lib
:link-upickle: https://github.com/com-lihaoyi/upickle
:link-scalatags: https://github.com/com-lihaoyi/scalatags
:link-requests: https://github.com/lihaoyi/requests-scala
:link-cask: https://github.com/com-lihaoyi/cask
:link-fastparse: https://github.com/com-lihaoyi/fastparse
:idprefix:
:idseparator: -
:example-scalatags-version: 0.12.0

[source,scala,subs="attributes,verbatim"]
----
// Mill
ivy"com.lihaoyi::geny:{version}"
ivy"com.lihaoyi::geny::{version}" // Scala.js / Native

// SBT
"com.lihaoyi" %% "geny" % "{version}"
"com.lihaoyi" %%% "geny" % "{version}" // Scala.js / Native
----

Geny is a small library that provides push-based versions of common standard
library interfaces:

* <>, a push-based version of `scala.Iterator[T]`
* <>, a push-based version of `java.io.InputStream`
* <>, a pull-based subclass of `Writable`

More background behind the `Writable` and `Readable` interface can be found in
this blog post:

* http://www.lihaoyi.com/post/StandardizingIOInterfacesforScalaLibraries.html[Standardizing IO Interfaces for Scala Libraries]

== `Generator`

`Generator` is basically the inverse of a `scala.Iterator`: instead of the core
functionality being the pull-based `hasNext` and `next: T` methods, the core is
based around the push-based `generate` method, which is similar to `foreach`
with some tweaks.

Unlike a `scala.Iterator`, subclasses of `Generator` can guarantee any clean
up logic is performed by placing it after the `generate` call is made. This is
useful for using ``Generator``s to model streaming data from files or other
sources that require cleanup: the most common alternative, `scala.Iterator`,
has no way of guaranteeing that the file gets properly closed after reading.
Even so called "self-closing iterators" that close the file after the iterator
is exhausted fail to close the files if the developer uses `.head` or `.take`
to access the first few elements of the iterator, and never exhausts it.

Although `geny.Generator` is not part of the normal collections hierarchy, the
API is intentionally modelled after that of `scala.Iterator` and should be
mostly drop-in, with conversion functions provided where you need to interact
with APIs using the standard Scala collections.

Geny is intentionally a tiny library with one file and zero dependencies,
so you can depend on it (or even copy-paste it into your project) without
fear of taking on unknown heavyweight dependencies.

=== Construction

The two simplest ways to construct a `Generator` are via the `+Generator(...)+`
and `Generator.from` constructors:

[source,scala]
----
import geny.Generator

scala> Generator(0, 1, 2)
res1: geny.Generator[Int] = Generator(WrappedArray(0, 1, 2))

scala> Generator.from(Seq(1, 2, 3)) // pass in any iterable or iterator
res2: geny.Generator[Int] = Generator(List(1, 2, 3))
----

If you need a `Generator` for a source that needs cleanup (closing
file-handles, database connections, etc.) you can use the
`Generator.selfClosing` constructor:

[source,scala]
----
scala> class DummyCloseableSource {
| val iterator = Iterator(1, 2, 3, 4, 5, 6, 7, 8, 9)
| var closed = false
| def close() = {
| closed = true
| }
| }
defined class DummyCloseableSource

scala> val g = Generator.selfClosing {
| val closeable = new DummyCloseableSource()
| (closeable.iterator, () => closeable.close())
| }
g: geny.Generator[Int] = Gen.SelfClosing(...)
----

This constructor takes a block that will be called to generate a tuple of an
`Iterator[T]` and a cleanup function of type `+() => Unit+`. Each time the
`Generator` is evaluated:

* A new pair of `+(Iterator[T], () => Unit)+` is created using this block
* The iterator is used to generate however many elements are necessary
* the cleanup function is called.

=== Terminal Operations

Transformations on a `Generator` are lazy: calling methods like `filter`
or `map` do not evaluate the entire Generator, but instead construct a new
Generator that delegates to the original. The only methods that evaluate
the `Generator` are the "terminal operation" methods like
`foreach`/`find`, or the "Conversion" methods like `toArray` or
similar. In this way, `Generator` behaves similarly to `Iterator`, whose
`map`/`filter` methods are also lazy until terminal oepration is called.

Terminal operations include the following:

[source,scala]
----
scala> Generator(0, 1, 2).toSeq
res3: Seq[Int] = ArrayBuffer(0, 1, 2)

scala> Generator(0, 1, 2).reduceLeft(_ + _)
res4: Int = 3

scala> Generator(0, 1, 2).foldLeft(0)(_ + _)
res5: Int = 3

scala> Generator(0, 1, 2).exists(_ == 3)
res6: Boolean = false

scala> Generator(0, 1, 2).count(_ > 0)
res7: Int = 2

scala> Generator(0, 1, 2).forall(_ >= 0)
res8: Boolean = true
----

Overall, they behave mostly the same as on the standard Scala collections.
Not every method is supported, but even those that aren't provided can easily
be re-implemented using `foreach` and the other methods available.

=== Transformations

Transformations on a `Generator` are lazy: they do not immediately return a
result, and only build up a computation:

[source,scala]
----
scala> Generator(0, 1, 2).map(_ + 1)
res9: geny.Generator[Int] = Generator(WrappedArray(0, 1, 2)).map()

scala> Generator(0, 1, 2).map { x => println(x); x + 1 }
res10: geny.Generator[Int] = Generator(WrappedArray(0, 1, 2)).map()
----

This computation will be evaluated when one of the
<>s described above is called:

[source,scala]
----
scala> res10.toSeq
0
1
2
res11: Seq[Int] = ArrayBuffer(1, 2, 3)
----

Most of the common operations on the Scala collections are supported:

[source,scala]
----
scala> (Generator(0, 1, 2).filter(_ % 2 == 0).map(_ * 2).drop(2) ++
Generator(5, 6, 7).map(_.toString.toSeq).flatMap(x => x))
res12: geny.Generator[AnyVal] = Generator(WrappedArray(0, 1, 2)).filter().map().slice(2, 2147483647) ++ Generator(WrappedArray(5, 6, 7)).map().map()

scala> res12.toSeq
res13: Seq[AnyVal] = ArrayBuffer(5, 6, 7)

scala> Generator(0, 1, 2, 3, 4, 5, 6, 7, 8, 9).flatMap(i => i.toString.toSeq).takeWhile(_ != '6').zipWithIndex.filter(_._1 != '2')
res14: geny.Generator[(Char, Int)] = Generator(WrappedArray(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)).map().takeWhile().zipWithIndex.filter()

scala> res14.toVector
res15: Vector[(Char, Int)] = Vector((0,0), (1,1), (3,3), (4,4), (5,5))
----

As you can see, you can `flatMap`, `filter`, `map`, `drop`, `takeWhile`, `pass:c[++]`
and call other methods on the `Generator`, and it simply builds up the
computation without running it. Only when a terminal operation like
`toSeq` or `toVector` is called is it finally evaluated into a result.

Note that a `geny.Generator` is immutable, and is thus never exhausted.
However, it also does not perform any memoization or caching, and so calling
a terminal operation like `.toSeq` on a `Generator` multiple times will
evaluate any preceding transformations multiple times. If you do not want this
to be the case, call `.toSeq` to turn it into a concrete sequence and work with
that.

=== Self Closing Generators

One major use case of `geny.Generator` is to ensure resources involved in
streaming results from some external source get properly cleaned up. For
example, using `scala.io.Source`, we can get a `scala.Iterator` over the
lines of a file. For example, you may define a helper function like this:

[source,scala]
----
def getFileLines(path: String): Iterator[String] = {
val s = scala.io.Source.fromFile(path)(charSet)
s.getLines()
}
----

However, this is incorrect: you never close the source `s`, and thus if you
call this lots of times, you end up leaving tons of open file handles! If you
are lucky this will crash your program; if you are unlucky it will hang your
kernel and force you to reboot your computer.

One solution to this would be to simply not write helper functions: everyone
who wants to read from a file must instantiate the `scala.io.Source`
themselves, and manually cleanup themselves. This is a possible solution, but
is tedious and annoying. Another possible solution is to have the `Iterator`
close the `io.Source` itself when exhausted, but this still leaves open the
possibility that the caller will use `.head` or `.take` on the iterator: a
perfectly reasonable thing to do if you don't need all the output, but one
that would leave a "self-closing" iterator open and still leaking file handles.

Using ``geny.Generator``s, the helper function can instead return a
`Generator.selfClosing`:

[source,scala]
----
def getFileLines(path: String): geny.Generator[String] = Generator.selfClosing {
val s = scala.io.Source.fromFile(path)(charSet)
(s.getLines(), () => s.close())
}
----

The caller can then use normal collection operations on the returned
`geny.Generator`: `map` it, `filter` it, `take`, `toSeq`, etc. and it will
always be properly opened when a terminal operation is called, the required
operations performed, and properly closed when everything is done.

== `Writable`

`geny.Writable` is a minimal interface that can be implemented by any data type
that writes binary output to a `java.io.OutputStream`:

[source,scala]
----
trait Writable {
def writeBytesTo(out: OutputStream): Unit
}
----

`Writable` allows for zero-friction zero-overhead streaming data exchange
between these libraries, e.g. allowing you pass Scalatags ``Frag``s directly
`os.write`:

[source,scala,subs="attributes,verbatim"]
----
@ import $ivy.`com.lihaoyi::scalatags:{example-scalatags-version}`, scalatags.Text.all._
import $ivy.$ , scalatags.Text.all._

@ os.write(os.pwd / "hello.html", html(body(h1("Hello"), p("World!"))))

@ os.read(os.pwd / "hello.html")
res1: String = "

Hello

World!

"
----

Sending ``ujson.Value``s directly to `requests.post`

[source,scala]
----
@ requests.post("https://httpbin.org/post", data = ujson.Obj("hello" -> 1))

@ res2.text
res3: String = """{
"args": {},
"data": "{\"hello\":1}",
"files": {},
"form": {},
...
----

Serialize Scala data types directly to disk:

[source,scala]
----
@ os.write(os.pwd / "two.json", upickle.default.stream(Map((1, 2) -> (3, 4), (5, 6) -> (7, 8))))

@ os.read(os.pwd / "two.json")
res5: String = "[[[1,2],[3,4]],[[5,6],[7,8]]]"
----

Or streaming file uploads over HTTP:

[source,scala]
----
@ requests.post("https://httpbin.org/post", data = os.read.stream(os.pwd / "two.json")).text
res6: String = """{
"args": {},
"data": "[[[1,2],[3,4]],[[5,6],[7,8]]]",
"files": {},
"form": {},
----

All this data exchange happens efficiently in a streaming fashion, without
unnecessarily buffering data in-memory.

`geny.Writable` also allows an implementation to ensure cleanup code runs after
all data has been written (e.g. closing file handles, free-ing managed
resources) and is much easier to implement than `java.io.InputStream`.

Writable has implicit constructors from the following types:

* `String`
* `Array[Byte]`
* `java.io.InputStream`

And implemented by the following libraries:

* {link-upickle}[uPickle]: implemented by `ujson.Value`,
`upack.Msg`, and can be constructed from JSON-serializable data structures via
`upickle.default.stream` or `upickle.default.writableBinary`
* {link-scalatags}[Scalatags]: implemented by `scalatags.Text.Tag`
* {link-requests}[Requests-Scala]:
`+requests.get.stream(...)+` methods return a <> subtype of
<>
* https://github.com/lihaoyi/os-lib[OS-Lib]: `os.read.stream` returns a
<> subtype of <>
* https://github.com/lihaoyi/cask[Cask]: `cask.Request` returns a
<> subtype of <>

And is accepted by the following libraries:

* {link-requests}[Requests-Scala] takes <> in the
`data =` field of `requests.post` and `requests.put`
* {link-oslib}[OS-Lib] accepts a <> in `os.write` and
the `stdin` parameter of `subprocess.call` or `subprocess.spawn`
* {link-cask}[Cask]: supports returning a <>
from any Cask endpoint

Any data type that writes bytes out to a `java.io.OutputStream`,
`java.io.Writer`, or `StringBuilder` can be trivially made to implement
<>, which allows it to output data in a streaming fashion without
needing to buffer it in memory. You can also implement <>s in your own
datatypes or accept it in your own method, if you want to inter-operate with
this existing ecosystem of libraries.

== `Readable`

[source,scala]
----
trait Readable extends Writable {
def readBytesThrough[T](f: InputStream => T): T
def writeBytesTo(out: OutputStream): Unit = readBytesThrough(Internal.transfer(_, out))
}
----

`Readable` is a subtype of <> that provides an additional
guarantee: not only can it be written to an `java.io.OutputStream`, it can also
be read from by providing a `java.io.InputStream`. Note that the `InputStream`
is scoped and only available within the `readBytesThrough` callback: after that
the `InputStream` will be closed and associated resources (HTTP connections,
file handles, etc.) will be released.

`Readable` is supported by the following built in types:

* `String`
* `Array[Byte]`
* `java.io.InputStream`

Implemented by the following libraries

* {link-requests}[Requests-Scala]:
`+requests.get.stream(...)+` methods return a <>
* {link-oslib}[OS-Lib]: `os.read.stream` returns a
<>
* {link-cask}[Cask]: `cask.Request` implements <>
to allow streaming of request data

And is accepted by the following libraries:

* {link-upickle}[uPickle]: `upickle.default.read`,
`upickle.default.readBinary`, `ujson.read`, and `upack.read` all support
`Readable`
* {link-fastparse}[FastParse]: `fastparse.parse` accepts
parsing streaming input from any `Readable`

`Readable` can be used to allow handling of streaming input, e.g. parsing JSON
directly from a file or HTTP request, without needing to buffer the whole file
in memory:

[source,scala]
----
@ val data = ujson.read(requests.get.stream("https://api.github.com/events"))
data: ujson.Value.Value = Arr(
ArrayBuffer(
Obj(
LinkedHashMap(
"id" -> Str("11169088214"),
"type" -> Str("PushEvent"),
"actor" -> Obj(
...
----

You can also implement `Readable` in your own data types, to allow them to be
seamlessly passed into uPickle or FastParse to be parsed in a streaming fashion.

Note that in exchange for the reduced memory usage, parsing streaming data via
`Readable` in uPickle or FastParse typically comes with a 20-40% CPU performance
penalty over parsing data already in memory, due to the additional book-keeping
necessary with streaming data. Whether it is worthwhile or not depends on your
particular usage pattern.

== Changelog

=== 1.1.1 - 2024-06-14

* Implement `.grouped` and `.sliding` operators

=== 1.1.0 - 2024-04-14

* Support for Scala-Native 0.5.0
* Minimum version of Scala 3 increased from 3.1.3 to 3.3.1
* Minimum version of Scala 2 increased from 2.11.x to 2.12.x

=== 1.0.0 - 2022-09-15

* Support Semantic Versioning
* Removed deprecated API

=== 0.7.1 - 2022-01-23

* Support Scala Native for Scala 3

=== 0.7.0 - 2021-12-10

_Re-release of 0.6.11_

=== Older Versions

==== 0.6.11 - 2021-11-26

* Add `httpContentType` to `inputStreamReadable`
* Improved Build and CI setup
* Added MiMa checks

==== 0.6.10 - 2021-05-14

* Add support for Scala 3.0.0

==== 0.6.9 - 2021-04-28

* Add support for Scala 3.0.0-RC3

==== 0.6.8 - 2021-04-28

* Add support for Scala 3.0.0-RC2

==== 0.6.4

* Scala-Native 0.4.0 support

==== 0.6.2

* Improve performance of writing small strings via `StringWritable`

==== 0.5.0

* Improve streaming of ``InputStream``s to ``OutputStream``s by dynamically sizing
the transfer buffer.

==== 0.4.2

* Standardize `geny.Readable` as well

==== 0.2.0

* Added <> interface

==== 0.1.8

* Support for Scala 2.13.0 final

==== 0.1.6 - 2019-01-15

* Add scala-native support

==== 0.1.5

* Add `.withFilter`

==== 0.1.4

* Add `.collect`, `.collectFirst`, `.headOption` methods

==== 0.1.3

* Allow calling `.count()` without a predicate to count the total number of items
in the generator

==== 0.1.2

* Add `.reduce`, `.fold`, `.sum`, `.product`, `.min`, `.max`, `.minBy`, `.maxBy`
* Rename `.fromIterable` to `.from`, make it also take ``Iterator``s

==== 0.1.1

* Publish for Scala 2.12.0

==== 0.1.0

* First release