Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/com-lihaoyi/geny

Provides the geny.Generator data type, the dual to a scala.Iterator that can ensure resource cleanup
https://github.com/com-lihaoyi/geny
scala
Last synced: 5 days ago
JSON representation
Provides the geny.Generator data type, the dual to a scala.Iterator that can ensure resource cleanup
Host: GitHub
URL: https://github.com/com-lihaoyi/geny
Owner: com-lihaoyi
License: other
Created: 2016-10-16T12:25:53.000Z (about 8 years ago)
Default Branch: main
Last Pushed: 2024-10-24T00:58:02.000Z (2 months ago)
Last Synced: 2024-12-16T01:02:40.792Z (12 days ago)
Topics: scala
Language: Scala
Homepage:
Size: 119 KB
Stars: 93
Watchers: 5
Forks: 25
Open Issues: 14
Metadata Files:
- Readme: Readme.adoc
- License: LICENSE
Awesome Lists containing this project

README

        = Geny

:version: 1.1.1

:toc-placement: preamble

:toc:

:link-geny: https://github.com/com-lihaoyi/geny

:link-oslib: https://github.com/com-lihaoyi/os-lib

:link-upickle: https://github.com/com-lihaoyi/upickle

:link-scalatags: https://github.com/com-lihaoyi/scalatags

:link-requests: https://github.com/lihaoyi/requests-scala

:link-cask: https://github.com/com-lihaoyi/cask

:link-fastparse: https://github.com/com-lihaoyi/fastparse

:idprefix:

:idseparator: -

:example-scalatags-version: 0.12.0

[source,scala,subs="attributes,verbatim"]

----

// Mill

ivy"com.lihaoyi::geny:{version}"

ivy"com.lihaoyi::geny::{version}" // Scala.js / Native

// SBT

"com.lihaoyi" %% "geny" % "{version}"

"com.lihaoyi" %%% "geny" % "{version}" // Scala.js / Native

----

Geny is a small library that provides push-based versions of common standard

library interfaces:

* <>, a push-based version of `scala.Iterator[T]`

* <>, a push-based version of `java.io.InputStream`

* <>, a pull-based subclass of `Writable`

More background behind the `Writable` and `Readable` interface can be found in

this blog post:

* http://www.lihaoyi.com/post/StandardizingIOInterfacesforScalaLibraries.html[Standardizing IO Interfaces for Scala Libraries]

== `Generator`

`Generator` is basically the inverse of a `scala.Iterator`: instead of the core

functionality being the pull-based `hasNext` and `next: T` methods, the core is

based around the push-based `generate` method, which is similar to `foreach`

with some tweaks.

Unlike a `scala.Iterator`, subclasses of `Generator` can guarantee any clean

up logic is performed by placing it after the `generate` call is made. This is

useful for using ``Generator``s to model streaming data from files or other

sources that require cleanup: the most common alternative, `scala.Iterator`,

has no way of guaranteeing that the file gets properly closed after reading.

Even so called "self-closing iterators" that close the file after the iterator

is exhausted fail to close the files if the developer uses `.head` or `.take`

to access the first few elements of the iterator, and never exhausts it.

Although `geny.Generator` is not part of the normal collections hierarchy, the

API is intentionally modelled after that of `scala.Iterator` and should be

mostly drop-in, with conversion functions provided where you need to interact

with APIs using the standard Scala collections.

Geny is intentionally a tiny library with one file and zero dependencies,

so you can depend on it (or even copy-paste it into your project) without

fear of taking on unknown heavyweight dependencies.

=== Construction

The two simplest ways to construct a `Generator` are via the `+Generator(...)+`

and `Generator.from` constructors:

[source,scala]

----

import geny.Generator

scala> Generator(0, 1, 2)

res1: geny.Generator[Int] = Generator(WrappedArray(0, 1, 2))

scala> Generator.from(Seq(1, 2, 3)) // pass in any iterable or iterator

res2: geny.Generator[Int] = Generator(List(1, 2, 3))

----

If you need a `Generator` for a source that needs cleanup (closing

file-handles, database connections, etc.) you can use the

`Generator.selfClosing` constructor:

[source,scala]

----

scala> class DummyCloseableSource {

     |   val iterator = Iterator(1, 2, 3, 4, 5, 6, 7, 8, 9)

     |   var closed = false

     |   def close() = {

     |     closed = true

     |   }

     | }

defined class DummyCloseableSource

scala> val g = Generator.selfClosing {

     |   val closeable = new DummyCloseableSource()

     |   (closeable.iterator, () => closeable.close())

     | }

g: geny.Generator[Int] = Gen.SelfClosing(...)

----

This constructor takes a block that will be called to generate a tuple of an

`Iterator[T]` and a cleanup function of type `+() => Unit+`. Each time the

`Generator` is evaluated:

* A new pair of `+(Iterator[T], () => Unit)+` is created using this block

* The iterator is used to generate however many elements are necessary

* the cleanup function is called.

=== Terminal Operations

Transformations on a `Generator` are lazy: calling methods like `filter`

or `map` do not evaluate the entire Generator, but instead construct a new

Generator that delegates to the original. The only methods that evaluate

the `Generator` are the "terminal operation" methods like

`foreach`/`find`, or the "Conversion" methods like `toArray` or

similar. In this way, `Generator` behaves similarly to `Iterator`, whose

`map`/`filter` methods are also lazy until terminal oepration is called.

Terminal operations include the following:

[source,scala]

----

scala> Generator(0, 1, 2).toSeq

res3: Seq[Int] = ArrayBuffer(0, 1, 2)

scala> Generator(0, 1, 2).reduceLeft(_ + _)

res4: Int = 3

scala> Generator(0, 1, 2).foldLeft(0)(_ + _)

res5: Int = 3

scala> Generator(0, 1, 2).exists(_ == 3)

res6: Boolean = false

scala> Generator(0, 1, 2).count(_ > 0)

res7: Int = 2

scala> Generator(0, 1, 2).forall(_ >= 0)

res8: Boolean = true

----

Overall, they behave mostly the same as on the standard Scala collections.

Not every method is supported, but even those that aren't provided can easily

be re-implemented using `foreach` and the other methods available.

=== Transformations

Transformations on a `Generator` are lazy: they do not immediately return a

result, and only build up a computation:

[source,scala]

----

scala> Generator(0, 1, 2).map(_ + 1)

res9: geny.Generator[Int] = Generator(WrappedArray(0, 1, 2)).map()

scala> Generator(0, 1, 2).map { x => println(x); x + 1 }

res10: geny.Generator[Int] = Generator(WrappedArray(0, 1, 2)).map()

----

This computation will be evaluated when one of the

<>s described above is called:

[source,scala]

----

scala> res10.toSeq

0

1

2

res11: Seq[Int] = ArrayBuffer(1, 2, 3)

----

Most of the common operations on the Scala collections are supported:

[source,scala]

----

scala> (Generator(0, 1, 2).filter(_ % 2 == 0).map(_ * 2).drop(2) ++

       Generator(5, 6, 7).map(_.toString.toSeq).flatMap(x => x))

res12: geny.Generator[AnyVal] = Generator(WrappedArray(0, 1, 2)).filter().map().slice(2, 2147483647) ++ Generator(WrappedArray(5, 6, 7)).map().map()

scala> res12.toSeq

res13: Seq[AnyVal] = ArrayBuffer(5, 6, 7)

scala> Generator(0, 1, 2, 3, 4, 5, 6, 7, 8, 9).flatMap(i => i.toString.toSeq).takeWhile(_ != '6').zipWithIndex.filter(_._1 != '2')

res14: geny.Generator[(Char, Int)] = Generator(WrappedArray(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)).map().takeWhile().zipWithIndex.filter()

scala> res14.toVector

res15: Vector[(Char, Int)] = Vector((0,0), (1,1), (3,3), (4,4), (5,5))

----

As you can see, you can `flatMap`, `filter`, `map`, `drop`, `takeWhile`, `pass:c[++]`

and call other methods on the `Generator`, and it simply builds up the

computation without running it. Only when a terminal operation like

`toSeq` or `toVector` is called is it finally evaluated into a result.

Note that a `geny.Generator` is immutable, and is thus never exhausted.

However, it also does not perform any memoization or caching, and so calling

a terminal operation like `.toSeq` on a `Generator` multiple times will

evaluate any preceding transformations multiple times. If you do not want this

to be the case, call `.toSeq` to turn it into a concrete sequence and work with

that.

=== Self Closing Generators

One major use case of `geny.Generator` is to ensure resources involved in

streaming results from some external source get properly cleaned up. For

example, using `scala.io.Source`, we can get a `scala.Iterator` over the

lines of a file. For example, you may define a helper function like this:

[source,scala]

----

def getFileLines(path: String): Iterator[String] = {

  val s = scala.io.Source.fromFile(path)(charSet)

  s.getLines()

}

----

However, this is incorrect: you never close the source `s`, and thus if you

call this lots of times, you end up leaving tons of open file handles! If you

are lucky this will crash your program; if you are unlucky it will hang your

kernel and force you to reboot your computer.

One solution to this would be to simply not write helper functions: everyone

who wants to read from a file must instantiate the `scala.io.Source`

themselves, and manually cleanup themselves. This is a possible solution, but

is tedious and annoying. Another possible solution is to have the `Iterator`

close the `io.Source` itself when exhausted, but this still leaves open the

possibility that the caller will use `.head` or `.take` on the iterator: a

perfectly reasonable thing to do if you don't need all the output, but one

that would leave a "self-closing" iterator open and still leaking file handles.

Using ``geny.Generator``s, the helper function can instead return a

`Generator.selfClosing`:

[source,scala]

----

def getFileLines(path: String): geny.Generator[String] = Generator.selfClosing {

  val s = scala.io.Source.fromFile(path)(charSet)

  (s.getLines(), () => s.close())

}

----

The caller can then use normal collection operations on the returned

`geny.Generator`: `map` it, `filter` it, `take`, `toSeq`, etc. and it will

always be properly opened when a terminal operation is called, the required

operations performed, and properly closed when everything is done.

== `Writable`

`geny.Writable` is a minimal interface that can be implemented by any data type

that writes binary output to a `java.io.OutputStream`:

[source,scala]

----

trait Writable {

  def writeBytesTo(out: OutputStream): Unit

}

----

`Writable` allows for zero-friction zero-overhead streaming data exchange

between these libraries, e.g. allowing you pass Scalatags ``Frag``s directly

`os.write`:

[source,scala,subs="attributes,verbatim"]

----

@ import $ivy.`com.lihaoyi::scalatags:{example-scalatags-version}`, scalatags.Text.all._

import $ivy.$                             , scalatags.Text.all._

@ os.write(os.pwd / "hello.html", html(body(h1("Hello"), p("World!"))))

@ os.read(os.pwd / "hello.html")

res1: String = "
Hello
World!"

----

Sending ``ujson.Value``s directly to `requests.post`

[source,scala]

----

@ requests.post("https://httpbin.org/post", data = ujson.Obj("hello" -> 1))

@ res2.text

res3: String = """{

  "args": {},

  "data": "{\"hello\":1}",

  "files": {},

  "form": {},

...

----

Serialize Scala data types directly to disk:

[source,scala]

----

@ os.write(os.pwd / "two.json", upickle.default.stream(Map((1, 2) -> (3, 4), (5, 6) -> (7, 8))))

@ os.read(os.pwd / "two.json")

res5: String = "[[[1,2],[3,4]],[[5,6],[7,8]]]"

----

Or streaming file uploads over HTTP:

[source,scala]

----

@ requests.post("https://httpbin.org/post", data = os.read.stream(os.pwd / "two.json")).text

res6: String = """{

  "args": {},

  "data": "[[[1,2],[3,4]],[[5,6],[7,8]]]",

  "files": {},

  "form": {},

----

All this data exchange happens efficiently in a streaming fashion, without

unnecessarily buffering data in-memory.

`geny.Writable` also allows an implementation to ensure cleanup code runs after

all data has been written (e.g. closing file handles, free-ing managed

resources) and is much easier to implement than `java.io.InputStream`.

Writable has implicit constructors from the following types:

* `String`

* `Array[Byte]`

* `java.io.InputStream`

And implemented by the following libraries:

* {link-upickle}[uPickle]: implemented by `ujson.Value`,

`upack.Msg`, and can be constructed from JSON-serializable data structures via

`upickle.default.stream` or `upickle.default.writableBinary`

* {link-scalatags}[Scalatags]: implemented by `scalatags.Text.Tag`

* {link-requests}[Requests-Scala]:

`+requests.get.stream(...)+` methods return a <> subtype of

<>

* https://github.com/lihaoyi/os-lib[OS-Lib]: `os.read.stream` returns a

<> subtype of <>

* https://github.com/lihaoyi/cask[Cask]: `cask.Request` returns a

<> subtype of <>

And is accepted by the following libraries:

* {link-requests}[Requests-Scala] takes <> in the

`data =` field of `requests.post` and `requests.put`

* {link-oslib}[OS-Lib] accepts a <> in `os.write` and

the `stdin` parameter of `subprocess.call` or `subprocess.spawn`

* {link-cask}[Cask]: supports returning a <>

from any Cask endpoint

Any data type that writes bytes out to a `java.io.OutputStream`,

`java.io.Writer`, or `StringBuilder` can be trivially made to implement

<>, which allows it to output data in a streaming fashion without

needing to buffer it in memory. You can also implement <>s in your own

datatypes or accept it in your own method, if you want to inter-operate with

this existing ecosystem of libraries.

== `Readable`

[source,scala]

----

trait Readable extends Writable {

  def readBytesThrough[T](f: InputStream => T): T

  def writeBytesTo(out: OutputStream): Unit = readBytesThrough(Internal.transfer(_, out))

}

----

`Readable` is a subtype of <> that provides an additional

guarantee: not only can it be written to an `java.io.OutputStream`, it can also

be read from by providing a `java.io.InputStream`. Note that the `InputStream`

is scoped and only available within the `readBytesThrough` callback: after that

the `InputStream` will be closed and associated resources (HTTP connections,

file handles, etc.) will be released.

`Readable` is supported by the following built in types:

* `String`

* `Array[Byte]`

* `java.io.InputStream`

Implemented by the following libraries

* {link-requests}[Requests-Scala]:

`+requests.get.stream(...)+` methods return a <>

* {link-oslib}[OS-Lib]: `os.read.stream` returns a

<>

* {link-cask}[Cask]: `cask.Request` implements <>

to allow streaming of request data

And is accepted by the following libraries:

* {link-upickle}[uPickle]: `upickle.default.read`,

`upickle.default.readBinary`, `ujson.read`, and `upack.read` all support

`Readable`

* {link-fastparse}[FastParse]: `fastparse.parse` accepts

parsing streaming input from any `Readable`

`Readable` can be used to allow handling of streaming input, e.g. parsing JSON

directly from a file or HTTP request, without needing to buffer the whole file

in memory:

[source,scala]

----

@ val data = ujson.read(requests.get.stream("https://api.github.com/events"))

data: ujson.Value.Value = Arr(

  ArrayBuffer(

    Obj(

      LinkedHashMap(

        "id" -> Str("11169088214"),

        "type" -> Str("PushEvent"),

        "actor" -> Obj(

...

----

You can also implement `Readable` in your own data types, to allow them to be

seamlessly passed into uPickle or FastParse to be parsed in a streaming fashion.

Note that in exchange for the reduced memory usage, parsing streaming data via

`Readable` in uPickle or FastParse typically comes with a 20-40% CPU performance

penalty over parsing data already in memory, due to the additional book-keeping

necessary with streaming data. Whether it is worthwhile or not depends on your

particular usage pattern.

== Changelog

=== 1.1.1 - 2024-06-14

* Implement `.grouped` and `.sliding` operators

=== 1.1.0 - 2024-04-14

* Support for Scala-Native 0.5.0

* Minimum version of Scala 3 increased from 3.1.3 to 3.3.1

* Minimum version of Scala 2 increased from 2.11.x to 2.12.x

=== 1.0.0 - 2022-09-15

* Support Semantic Versioning

* Removed deprecated API

=== 0.7.1 - 2022-01-23

* Support Scala Native for Scala 3

=== 0.7.0 - 2021-12-10

_Re-release of 0.6.11_

=== Older Versions

==== 0.6.11 - 2021-11-26

* Add `httpContentType` to `inputStreamReadable`

* Improved Build and CI setup

* Added MiMa checks

==== 0.6.10 - 2021-05-14

* Add support for Scala 3.0.0

==== 0.6.9 - 2021-04-28

* Add support for Scala 3.0.0-RC3

==== 0.6.8 - 2021-04-28

* Add support for Scala 3.0.0-RC2

==== 0.6.4

* Scala-Native 0.4.0 support

==== 0.6.2

* Improve performance of writing small strings via `StringWritable`

==== 0.5.0

* Improve streaming of ``InputStream``s to ``OutputStream``s by dynamically sizing

the transfer buffer.

==== 0.4.2

* Standardize `geny.Readable` as well

==== 0.2.0

* Added <> interface

==== 0.1.8

* Support for Scala 2.13.0 final

==== 0.1.6 - 2019-01-15

* Add scala-native support

==== 0.1.5

* Add `.withFilter`

==== 0.1.4

* Add `.collect`, `.collectFirst`, `.headOption`  methods

==== 0.1.3

* Allow calling `.count()` without a predicate to count the total number of items

in the generator

==== 0.1.2

* Add `.reduce`, `.fold`, `.sum`, `.product`, `.min`, `.max`, `.minBy`, `.maxBy`

* Rename `.fromIterable` to `.from`, make it also take ``Iterator``s

==== 0.1.1

* Publish for Scala 2.12.0

==== 0.1.0

* First release