https://github.com/calvinlfer/etl-workflow

An opinionated way to structure ETL pipelines with a heavy focus on reusability and testing
https://github.com/calvinlfer/etl-workflow

cats etl functional-programming scala

Last synced: 5 months ago
JSON representation

An opinionated way to structure ETL pipelines with a heavy focus on reusability and testing

Host: GitHub
URL: https://github.com/calvinlfer/etl-workflow
Owner: calvinlfer
Created: 2018-04-01T20:13:19.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2018-10-07T06:26:29.000Z (about 7 years ago)
Last Synced: 2025-07-01T05:07:25.412Z (5 months ago)
Topics: cats, etl, functional-programming, scala
Language: Scala
Homepage:
Size: 36.1 KB
Stars: 2
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # ETL Workflow _(beta)_ #

[![Build Status](https://travis-ci.org/calvinlfer/etl-workflow.svg?branch=master)](https://travis-ci.org/calvinlfer/etl-workflow)

[![Download](https://api.bintray.com/packages/calvinlfer/maven/etl-workflow/images/download.svg) ](https://bintray.com/calvinlfer/maven/etl-workflow/_latestVersion)

**ETL Workflow** is a simple and *opinionated* way to help you structure type-safe Extract-Transform-Load (**ETL**) 

pipelines. This Domain Specific Language (DSL) is flexible enough to create linear pipelines which involve a single 

`Extract` source and `Load` sink 

```

Extract source A ~> Transform A to B ~> Load B (sink 1)

```

all the way to stitching multiple Extract sources together and flowing the data through to multiple Load sinks

```

Extract source A ~>                               ~> Load D (sink 1)

                   \                             /

Extract source B    ~> Transform (A, B, C) to D ~>   Load D (sink 2)

                   /                             \

Extract source C ~>                               ~> Load D (sink 3)

``` 

It is built on an immutable and functional architecture where side-effects are executed at the end-of-the-world when the 

pipeline is run. 

This is intended to be used in conjunction with Spark (especially for doing ETL) in order to minimize boilerplate and 

have the ability to see an almost whiteboard-like representation of your pipeline.

## Usage ##

```sbt

resolvers += Resolver.bintrayRepo("calvinlfer","maven")

libraryDependencies += "com.ghostsequence %% "etl-workflow" % "

```

## Building Blocks ##

An ETL pipeline consists of the following building blocks:

#### `Extract[A]` ####

A producer of a single element of data whose type is `A`. This is the start of the ETL pipeline, you can connect this

to `Transform`ers or to a `Load[A, AStatus]` to create an `ETLPipeline[AStatus]` that can be run.

#### `Transform[A, B]` ####

A transformer of a an element `A` to `B` you can attach these after an `Extract[A]` or before a `Load[B]`

#### `Load[B, BStatus]` ####

The end of the pipeline which takes data `B` flowing through the pipeline and consumes it and produces a status 

`BStatus` which indicates whether consumption happens successfully

#### `ETLPipeline[ConsumeStatus]` ####

This represents the fully created ETL pipeline which can be executed using `unsafeRunSync()` to produce a 

`ConsumeStatus` which indicates whether the pipeline has finished successfully.

**Note:** At the end of the day, these building blocks are a reification of values and functions. You can build an 

ETL pipeline out of functions and values but it helps to have a Domain Specific Language to increase readability.

## Examples ##

See [here](src/main/tut/Examples.md) for examples on how to get started

### Inspiration ###

- [Mario](https://github.com/intentmedia/mario)

- [Akka Streams](https://doc.akka.io/docs/akka/2.5/stream/index.html)

- [Monix Observables](https://monix.io)

### Release process ###

Make sure you have the correct [Bintray credentials](http://queirozf.com/entries/publishing-an-sbt-project-onto-bintray-an-example)

before proceeding:

```bash

sbt release

```

This will automatically create a Git Tag and publish the library to Bintray for all Scala versions.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/calvinlfer/etl-workflow

Awesome Lists containing this project

README