https://github.com/mattlianje/etl4s

Powerful, whiteboard-style ETL
https://github.com/mattlianje/etl4s
etl functional-programming scala streaming
Last synced: 19 days ago
JSON representation
Powerful, whiteboard-style ETL
Host: GitHub
URL: https://github.com/mattlianje/etl4s
Owner: mattlianje
License: gpl-3.0
Created: 2024-12-07T04:51:11.000Z (12 months ago)
Default Branch: master
Last Pushed: 2025-04-04T23:50:06.000Z (8 months ago)
Last Synced: 2025-04-05T00:26:45.442Z (8 months ago)
Topics: etl, functional-programming, scala, streaming
Language: Scala
Homepage:
Size: 1.72 MB
Stars: 46
Watchers: 2
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          


  



#  etl4s

**Powerful, whiteboard-style ETL**

A lightweight, zero-dependency library for writing type-safe, beautiful ✨🍰  data flows in functional Scala. 

Battle-tested at [Instacart](https://www.instacart.com/). Part of [d4](https://github.com/mattlianje/d4). ([Full Documentation](https://mattlianje.github.io/etl4s/))



  

  _{Sponsored by Bit Complete}



## Features

- Declarative, typed pipeline endpoints

- Use **Etl4s.scala** like a header file

- Type-safe, compile-time checked

- [Config-driven](#configuration) by design

- Easy, monadic composition of pipelines

- Built-in retry/failure handling

- Automatic [tracing](#introspection-with-etl4strace)

- Drop-in [telemetry](#telemetry)

- [Data lineage](#lineage) visualization

## Installation

**etl4s** is on MavenCentral and cross-built for Scala, 2.12, 2.13, 3.x

```scala

"xyz.matthieucourt" %% "etl4s" % "1.6.0"

```

Or try in REPL:

```bash

scala-cli repl --scala 3 --dep xyz.matthieucourt:etl4s_3:1.6.0

```

All you need:

```scala

import etl4s._

```

## Quick Example

```scala

import etl4s._

val getUser  = Extract("John Doe")

val getOrder = Extract("Order #1234")

val combine  = Transform[(String, String), String] { case (user, order) => 

  s"$user placed $order" 

}

val saveDb    = Load[String, String](s => { println(s"DB: $s"); s })

val sendEmail = Load[String, Unit](s => println(s"Email: $s"))

val pipeline = (getUser & getOrder) ~> combine ~> (saveDb & sendEmail)

pipeline(())

```

## Documentation 

[Full Documentation](https://mattlianje.github.io/etl4s/) - Detailed guides, API references, and examples

## Of note...

- Ultimately - these nodes and pipelines are just reifications of functions and values (with a few niceties like built in retries, failure handling, concurrency-shorthand, and Future based parallelism).

- Chaotic, framework/infra-coupled ETL codebases that grow without an imposed discipline drive dev-teams and data-orgs to their knees.

- **etl4s** is a little DSL to enforce discipline, type-safety and re-use of pure functions - and see [functional ETL](https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a) for what it is... and could be.

## Core Concepts

**etl4s** has one core building block:

```scala

Node[-In, +Out]

```

A Node wraps a lazily-evaluated function `In => Out`. Chain them with `~>` to build pipelines.

To improve readability and express intent, **etl4s** defines four aliases: `Extract`, `Transform`, `Load` and `Pipeline`. All behave the same under the hood.

```scala

val step = Transform[String, Int](_.length)

step("hello")  // 5

```

**Running pipelines:**

- `pipeline(input)` - call like a function

- `.unsafeRun(input)` - explicit run

- `.safeRun(input)` - returns `Try[Out]`

- `.unsafeRunTrace(input)` - returns `Trace` (logs, timing, errors)

- `.safeRunTrace(input)` - returns `Trace` with `Try[Out]`

**DI:** Use `.requires` to turn any Node into a `Reader[Config, Node]`. The `~>` operator works between Nodes and Readers. See [Configuration](#configuration).

## Type safety

**etl4s** won't let you chain together "blocks" that don't fit together:

```scala

 val fiveExtract: Extract[Unit, Int]        = Extract(5)

 val exclaim:     Transform[String, String] = Transform(_ + "!")

 fiveExtract ~> exclaim

```

The above will not compile with:

```shell

-- [E007] Type Mismatch Error: -------------------------------------------------

4 | fiveExtract ~> exclaim

  |                ^^^^^^^

  |                Found:    (exclaim : Transform[String, String])

  |                Required: Node[Int, Any]

```

## Operators

etl4s uses a few simple operators to build pipelines:

| Operator | Name | Description | Example |

|----------|------|-------------|---------|

| `~>` | Connect | Chains operations in sequence | `e1 ~> t1 ~> l1` |

| `&` | Combine | Group sequential operations with same input | `t1 & t2` |

| `&>` | Parallel | Group concurrent operations with same input | `t1 &> t2` |

| `>>` | Sequence | Runs pipelines in order (ignoring previous output) | `p1 >> p2` |

## Configuration

Declare what each step `.requires`, then `.provide` it later:

```scala

import etl4s._

case class Cfg(key: String)

val A = Extract("data")

val B = Transform[String, String].requires[Cfg] { cfg => data =>

  s"${cfg.key}: $data"

}

val pipeline = A ~> B

pipeline.provide(Cfg("secret")).unsafeRun(())  /* "secret: data" */

/** NOTE (Scala 2.x)

  * Use: `Node.requires[Cfg, In, Out](cfg => in => out)` syntax

  */

```

**etl4s** automatically infers the smallest shared config for your pipeline. Just `.provide` once.

Read more [here](https://mattlianje.github.io/etl4s/config/)

## Parallelizing Tasks

**etl4s** has an elegant shorthand for grouping and parallelizing operations that share the same input type:

```scala

/* Simulate slow IO operations (e.g: DB calls, API requests) */

val e1 = Extract { Thread.sleep(100); 42 }

val e2 = Extract { Thread.sleep(100); "hello" }

val e3 = Extract { Thread.sleep(100); true }

```

Sequential run of e1, e2, and e3 **(~300ms total)**

```scala

val sequential: Extract[Unit, ((Int, String), Boolean)] =

     e1 & e2 & e3

```

Parallel run of e1, e2, e3 on their own JVM threads with Scala Futures **(~100ms total, same result, 3X faster)**

```scala

import scala.concurrent.ExecutionContext.Implicits.global

val parallel: Extract[Unit, ((Int, String), Boolean)] =

     e1 &> e2 &> e3

```

Use the built-in zip method to flatten unwieldly nested tuples:

```scala

val clean: Extract[Unit, (Int, String, Boolean)] =

     (e1 & e2 & e3).zip

```

Mix sequential and parallel execution (First two parallel (~100ms), then third (~100ms)):

```scala

val mixed = (e1 &> e2) & e3

```

Full example of a parallel pipeline:

```scala

val consoleLoad: Load[String, Unit] = Load(println(_))

val dbLoad:      Load[String, Unit] = Load(x => println(s"DB Load: ${x}"))

val merge = Transform[(Int, String, Boolean), String] { t => 

    val (i, s, b) = t

    s"$i-$s-$b"

  }

val pipeline =

  (e1 &> e2 &> e3).zip ~> merge ~> (consoleLoad &> dbLoad)

```

## Handling Failures

#### `withRetry`

Retry failed operations:

```scala

import etl4s._

var n = 0

val A = Transform[Int, String] { x =>

  n += 1

  if (n < 3) throw new RuntimeException("fail")

  else "ok"

}.withRetry(maxAttempts = 3, initialDelayMs = 10)

Extract(42) ~> A  /* Succeeds on 3rd attempt */

```

#### `onFailure`

Catch exceptions and recover:

```scala

import etl4s._

val A = Extract[Unit, String](_ => throw new RuntimeException("Boom!"))

  .onFailure(e => s"Error: ${e.getMessage}")

A.unsafeRun(())  /* Returns "Error: Boom!" */

```

## Side Outputs

The `tap` method performs side effects without disrupting pipeline flow:

```scala

import etl4s._

val A = Extract((_: Unit) => List("a.txt", "b.txt"))

           .tap(files => println(s"Cleanup: $files"))

val B = Transform[List[String], Int](_.size)

A ~> B

```

## Tracing

Nodes can access and update their runtime state with ThreadLocal channels spawned for free. All state is automatically shared across your entire pipeline. Read more [here](https://mattlianje.github.io/etl4s/trace/)

```scala

val A = Transform[String, Int] { s =>

  if (s.isEmpty) Trace.error("empty")

  s.length

}

val B = Transform[Int, String] { n =>

  if (Trace.hasErrors) "FALLBACK" else s"len: $n"  

}

(A ~> B).unsafeRun("")  /* "FALLBACK" */

```

## Telemetry

etl4s provides a minimal `Etl4sTelemetry` interface for observability. All pipeline run methods automatically look for this interface in implicit scope.

`Tel` is etl4s's telemetry API object with the same method names as the trait for consistency. All `Tel` calls are no-ops by default - zero overhead until you provide an implementation.

```scala

val A = Transform[List[String], Int] { data =>

  Tel.withSpan("op") {

    Tel.addCounter("n", data.size)

    Tel.setGauge("v", data.size.toDouble)

    data.map(_.length).sum

  }

}

/* By default Tel calls are no-ops (zero cost) */

A.unsafeRun(data)

/* Implement Etl4sTelemetry for your backend */

implicit val telemetry: Etl4sTelemetry = MyPrometheusProvider()

A.unsafeRun(data) /* metrics flow to Prometheus */

```

The `Etl4sTelemetry` interface has just 4 methods: `withSpan`, `addCounter`, `setGauge`, `recordHistogram`

which cover 95% of observability needs. Read more in the [Telemetry guide](https://mattlianje.github.io/etl4s/opentelemetry/).

## Lineage

Track data lineage and visualize pipeline dependencies. Attach metadata to any Node or Reader then call `.toDot`, `.toJson` or `.toMermaid`

on individual instances or on Sequences:

```scala

val A = Node[String, String](identity)

  .lineage(

    name = "A",

    inputs = List("s1", "s2"),

    outputs = List("s3"), 

    schedule = Some("0 */2 * * *")

  )

val B = Node[String, String](identity)

  .lineage(

    name = "B",

    inputs = List("s3"),

    outputs = List("s4", "s5")

  )

```

Export lineage as JSON, DOT (Graphviz), or Mermaid diagrams:

```scala

Seq(A, B).toJson

Seq(A, B).toDot

```



  



```scala

Seq(A, B).toMermaid

```

```mermaid

graph LR

    classDef pipeline fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000

    classDef dataSource fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#000

    classDef cluster fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#000

    A["A
(0 */2 * * *)"]

    B["B"]

    s1(["s1"])

    s2(["s2"])

    s3(["s3"])

    s4(["s4"])

    s5(["s5"])

    s1 --> A

    s2 --> A

    A --> s3

    s3 --> B

    B --> s4

    B --> s5

    A -.-> B

    linkStyle 6 stroke:#ff6b35,stroke-width:2px

    class A pipeline

    class B pipeline

    class s1 dataSource

    class s2 dataSource

    class s3 dataSource

    class s4 dataSource

    class s5 dataSource

```

**etl4s** automatically infers dependencies by matching output -> input sources. Nodes don't need to be connected with `~>` for lineage tracking. Explicit dependencies via `upstreams` also supported.

## Examples

#### Chain two pipelines

Simple UNIX-pipe style chaining of two pipelines:

```scala

import etl4s._

val p1 = Pipeline((i: Int) => i.toString)

val p2 = Pipeline((s: String) => s + "!")

val p3 = p1 ~> p2

```

#### Complex chaining

Connect the output of two pipelines to a third:

```scala

import etl4s._

val namePipeline = Pipeline("John Doe")

val agePipeline  = Pipeline(30)

val toUpper      = Transform[String, String](_.toUpperCase)

val consoleLoad  = Load[String, Unit](println(_))

val combined =

  for {

    name <- namePipeline

    age <- agePipeline

    _ <- Extract(s"$name | $age") ~> toUpper ~> consoleLoad

  } yield ()

```

## Real-world examples

**etl4s** works great with anything:

- Spark / Flink / Beam

- ETL / Streaming

- Distributed Systems

- Local scripts

- Big Data workflows

- Web-server dataflows

## Inspiration

- Debasish Ghosh's [Functional and Reactive Domain Modeling](https://www.manning.com/books/functional-and-reactive-domain-modeling)

- [Akka Streams DSL](https://doc.akka.io/libraries/akka-core/current/stream/stream-graphs.html#constructing-graphs)

- Various Rich Hickey talks
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mattlianje/etl4s

Awesome Lists containing this project

README