{"id":18656730,"url":"https://github.com/zendesk/scala-flow","last_synced_at":"2026-01-14T02:29:15.161Z","repository":{"id":57730033,"uuid":"82074782","full_name":"zendesk/scala-flow","owner":"zendesk","description":"A lightweight library intended to make developing Google DataFlow jobs in Scala easier.","archived":true,"fork":false,"pushed_at":"2020-08-26T21:03:50.000Z","size":84,"stargazers_count":14,"open_issues_count":3,"forks_count":1,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-04-11T18:38:24.742Z","etag":null,"topics":["dataflow","scala"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zendesk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-02-15T15:23:13.000Z","updated_at":"2024-11-23T16:50:44.000Z","dependencies_parsed_at":"2022-09-05T22:41:06.352Z","dependency_job_id":null,"html_url":"https://github.com/zendesk/scala-flow","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zendesk/scala-flow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zendesk%2Fscala-flow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zendesk%2Fscala-flow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zendesk%2Fscala-flow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zendesk%2Fscala-flow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zendesk","download_url":"https://codeload.github.com/zendesk/scala-flow/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zendesk%2Fscala-flow/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28408711,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T01:52:23.358Z","status":"online","status_checked_at":"2026-01-14T02:00:06.678Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataflow","scala"],"created_at":"2024-11-07T07:24:59.221Z","updated_at":"2026-01-14T02:29:15.146Z","avatar_url":"https://github.com/zendesk.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# scala-flow\n![repo-checks](https://github.com/zendesk/scala-flow/workflows/repo-checks/badge.svg)\n\n_scala-flow_ is a lightweight library intended to make developing Google DataFlow jobs in Scala easier. The core dataflow classes are enriched to allow more idiomatic and concise Scala usage while preserving full access to the underlying Java SDK.\n    \nCoders for Scala primitives and collection classes have been implemented so that you can conveniently return these types from your PTransforms. In addition you can easily create coders for your own case classes.\n\n**Caveat:** This library is still evolving rapidly as we improve our knowledge and understanding of Dataflow, so there will be a some flux in the API as we discover and refine what works well and what doesn't.\n    \nAs a preview of what's possible here's the eponymous MinimalWordCount example:\n\n```scala     \nPipeline.create(...)\n  .apply(TextIO.Read.from(\"gs://dataflow-samples/shakespeare/kinglear.txt\"))\n  .flatMap(_.split(\"\\\\W+\").filter(_.nonEmpty).toIterable)\n  .apply(Count.perElement[String])\n  .map(kv =\u003e kv.getKey + \": \" + kv.getValue)\n  .apply(TextIO.Write.to(\"results.text\"))\n  .run()\n```\n    \n## Usage\n\n#### Pipeline\n\n`Pipeline` has been enriched with a handful of methods.\n\nTo create a `PCollection` from in-memory data use the `transform` method instead of `apply`. This method ensures that the coder is set correctly on the input data. \n \nIn addition a `run` method has been added to the `POutput` type, so that you can fluently chain transforms then run your pipeline. For example:\n    \n```scala\nval result = Pipeline.create(...)\n  .transform(Create.of(\"foo\", \"bar\"))\n  .apply(...transforms...)\n  .run()\n```\n  \n#### Basic PCollection Methods\n  \n`PCollection` now has `map`, `flatMap`, `filter` and `collect` methods that each behave as you would expect.\n    \nSimple example:\n\n```scala\nval result = Pipeline.create(...)\n  .transform(Create.of(\"123\", \"456\", \"789\"))\n  .flatMap(_.split(\"\"))\n  .map(_.toInt)\n  .filter(_ \u003c 5)\n  .collect {\n    case x if x % 3 == 0 =\u003e if (x % 5 == 0) \"FizzBuzz\" else \"Fizz\"\n    case x if x % 5 == 0 =\u003e \"Buzz\"\n  }\n  .run()\n```\n\n#### PCollection Extras\n\n##### Logging Side Effect\n\nA side-effecting method `foreach` has been added in order to allow handy debug logging. This method supplies each element of the `PCollection` to it's argument then passes on the element unchanged.\nFor example:\n\n```scala\nval result = Pipeline(...)\n  .transform(Create.of(\"123\", \"456\", \"789\"))\n  .foreach(println)\n  .apply(...continue as normal...) \n```\n\n##### Extracting Timetamps\n\n`extractTimestamp` converts each element in the PCollection to a tuple with its corresponding timestamp. For example:\n```scala\nval collection: PCollection[(String, Instant)] = Pipeline.create(...)\n  .transform(Create.of(\"foo\", \"bar\"))\n  .withTimestamps\n```\n\n##### Converting to a `KV`\n\nThe `withKey` method provides a drop in replacement for the `WithKeys` transform.\n\n##### Merging PCollections of the same type\n\nThe `flattenWith` method is the equivalent to the `Flatten` transform, allowing collections of the same type to be merged together. For example:\n\n```scala\nval first: PCollection[String] = ...\nval second: PCollection[String] = ...\nval third: PCollection[String] = ...\n  \nval combined: PCollection[String] = first.flattenWith(second, third)\n```\n  \n##### Naming your transforms\n\nTo provide better visualization of the Pipeline graph and to allow updating of running jobs, you can name blocks of transforms using the `transformWith` method. For example:\n\n```scala\n  val result = Pipeline.create(...)\n    .transformWith(\"Load Resources\") { _\n      .apply(TextIO.Read.from(\"gs://dataflow-samples/shakespeare/kinglear.txt\"))\n    }\n    .transformWith(\"Split and count Words\") { _\n      .flatMap(_.split(\"\\\\W+\").filter(_.nonEmpty).toIterable)\n      .apply(Count.perElement[String])\n    }\n    .transformWith(\"Output Results\") { _\n      .map(kv =\u003e kv.getKey + \": \" + kv.getValue)\n      .apply(TextIO.Write.to(\"results.text\"))\n    }\n    .run()\n```\n \nUnder the hood this method simply converts each nested block of methods into a `PTransform` class. \n\n##### ParDo Escape Hatch\n\nThe `parDo` method provides an escape hatch in case none of the existing methods do what you want. Pass any arbitrary function wth a `DoFn` signature to this method and it will be converted to a `ParDo` transform. For example:\n```scala\nPipeline.create(...)\n  .apply(TextIO.Read.from(\"gs://dataflow-samples/shakespeare/kinglear.txt\"))\n  .parDo { (c: DoFn[String, String]#ProcessContext) =\u003e\n    /* Do anything whatsoever here */\n    c.output(...)\n  }\n```\n\n#### KV Collection\n\nSeveral methods have been added specifically for KV collections:\n\nThe `mapValue` and `flatMapValue` methods allow you to change the value of a `KV` pair without affecting they key. For example:\n```scala\nval result = Pipeline.create(...)\n  .transform(Create.of(\"123\", \"456\", \"789\")\n  .withKey(_.toInt)\n  .mapValue(_.split(\"\"))\n  .flatMapValue(_.mkString(\".\")\n\n/* Result contains KV(123, \"1.2.3\"), KV(456, \"4.5.6.\"), KV(789, \"7.8.9\") */  \n```\n\nIn addition there are `combinePerKey`, `topPerKey` and `groupPerKey` methods that work exactly the same as the Dataflow transform equivalents. \n\n#### Joining KV PCollections\n\nIn order to join two or more collections of `KV` values by key you can use `coGroupByKey`, a type-safe wrapper around Dataflow's [`CoGroupByKey`](https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/join/CoGroupByKey) transform.\n\n```scala\nval buyOrders: PCollection[KV[CustomerId, BuyOrder]] = ...\nval sellOrders: PCollection[KV[CustomerId, SellOrder]] = ...\n\nval allOrders: PCollection[KV[CustomerId, (Iterable[BuyOrder], Iterable[SellOrder])]] = buyOrders.coGroupByKey(sellOrders)\n```\n\n### Coders\n\nImplicit coders for the following types have been added:\n  \n  * `Int`, `Long`, `Double`\n  * `Option`, `Try`, `Either`\n  * `Tuple2` to `Tuple22`\n  * `Iterable`, `List`, `Set`, `Map`, `Array`\n  \nEvery method mentioned above required a coder for its output type to be implicitly available. This happens by default for any of the types listed above (and also any arbitrary combination e.g. `List[Option[(Either[String, Int], Array[Double])]]`)\nIf you create coders for any other types then you'll need to ensure that they are available in the implicit scope somewhere.\n\n#### Case Class Coders\n\nYou can create a custom coder for any case class containing up to 22 members using the `caseClassCoder` method. For example:\n```scala\ncase class Foo(name: String)\ncase class Bar(name: String, age: Int)\ncase class Qux[T](value : T)\n\nimplicit val fooCoder = caseClassCoder(Foo)\nimplicit val barCoder = caseClassCoder(Bar)\nimplicit def quxCoder = caseClassCoder(Qux.apply[T] _)\n```\n\nThe last line shows demonstrates how to create a coder for a generic types, this is essentially a much simpler replacement for a `CoderFactory`.\n\n#### Serializable Coder\n\nBy default Dataflow will always try to create a `SerializableCoder` if no other suitable coder can be found. `scala-flow` provides an equivalent with the `serializableCoder` method. For example:\n```scala\nclass Foo(val name: String) extends Serializable\n \nimplicit val fooCoder = serializableCoder(Foo) \n```\n\n## Why create a Scala Dataflow library? \n\nThere are already some existing libraries for working with Dataflow: \n  * [Apache Beam](https://beam.apache.org): Supports not only Dataflow, but also Spark, Apex and FLink. \n  * [Scio](https://github.com/spotify/scio): Spotify have developed this excellent and extensive Scala library   \n\nWe initially used Beam directly but quickly found that the complex nature of the Java API (particulary around type erasure), made Scala interop tricky. \nWe then evaluated Scio, but while we were learning the complex Dataflow concepts we wanted something that was very lightweight, and that kept us very close to the API. \nHence this library that we feel fits in a niche between the two libraries above. \n\n## Roadmap\n\n* Create version of each method that accepts a name to better support updating pipelines \n* Switch underlying support to Apache Beam\n\n## Credits\n\nThe case class coder approach was heavily inspired by [Spray Json](https://github.com/spray/spray-json), a really nice, light weight JSON parser. \n\n## Contributing\n\nBug reports and pull requests are welcome on GitHub at\nhttps://github.com/zendesk/scala-flow/\n\n## Copyright and license\n\nCopyright 2017 Zendesk, Inc.\n\nLicensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License.\n\nYou may obtain a copy of the License at\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzendesk%2Fscala-flow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzendesk%2Fscala-flow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzendesk%2Fscala-flow/lists"}