https://github.com/akornatskyy/sample-etl-flink-java

The sample ingests multiline gzipped files of popular books into postgres.
https://github.com/akornatskyy/sample-etl-flink-java

batch-processing etl flink ingestion java postgres sample

Last synced: 8 months ago
JSON representation

The sample ingests multiline gzipped files of popular books into postgres.

Host: GitHub
URL: https://github.com/akornatskyy/sample-etl-flink-java
Owner: akornatskyy
License: mit
Created: 2023-10-08T13:18:37.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2025-01-26T11:15:38.000Z (8 months ago)
Last Synced: 2025-01-26T12:20:19.717Z (8 months ago)
Topics: batch-processing, etl, flink, ingestion, java, postgres, sample
Language: Java
Homepage:
Size: 61.5 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # sample-etl-flink-java

[![tests](https://github.com/akornatskyy/sample-etl-flink-java/actions/workflows/tests.yaml/badge.svg)](https://github.com/akornatskyy/sample-etl-flink-java/actions/workflows/tests.yaml)

The sample ingests multiline gzipped files of popular books into postgres.

## Prerequisites

Ensure JDK 8 or 11 is installed in your system:

```sh

java -version

```

Flink [runs](https://nightlies.apache.org/flink/flink-docs-stable/docs/try-flink/local_installation/)

on UNIX-like environments, for Windows install

[cygwin](https://www.cygwin.com/) (include *mintty* and *netcat* packages)

to emulate linux commands or use WSL (note, *bash for windows* doesn't work).

Ensure the following (file `~/.bash_profile`):

```sh

# ignore windows line endings (skip \r)

export SHELLOPTS

set -o igncr

```

Update the number of task slots that TaskManager offers and add id

(file `conf/flink-conf.yaml`):

```yaml

taskmanager.numberOfTaskSlots: 4

taskmanager.resource-id: local

```

Start cluster and navigate to the web UI at

[http://localhost:8081](http://localhost:8081):

```sh

start-cluster.sh

```

## Prepare

Download and prepare dataset (as a multiline JSON file):

```sh

curl -sL https://github.com/luminati-io/Amazon-popular-books-dataset/raw/main/Amazon_popular_books_dataset.json | \

  jq -c '.[]' > dataset.json

```

Split the input (multiline JSON) file into parts with 400 lines per output file

and compress with gzip:

```sh

cat dataset.json | split -e -l400 -d --additional-suffix .json \

  --filter='gzip > $FILE.gz' - part_

```

## Postgres

There are a number of ways to run postgres, if you prefer to download binary and

run locally without installation, use the following steps:

```sh

bin/initdb --pgdata=data/ -U postgres -E 'UTF-8' \

  --lc-collate='en_US.UTF-8' --lc-ctype='en_US.UTF-8'

bin/postgres -D data/

```

Create *books* database and apply schema from *./misc/schema.sql*.

## Run

Optionally specify *--input-dir* for a directory to scan for input and/or a

connection to postgres (*--db-url*).

```sh

flink run -p 4 target/sample-etl-flink-java-1.0-SNAPSHOT.jar

flink run -p 4 target/sample-etl-flink-java-1.0-SNAPSHOT.jar \

  --input-dir ./ --db-url jdbc:postgresql://localhost:5432/books

```

Use *--disable-operator-chaining true* to see expanded execution graph.

```sh

flink run -p 4 target/sample-etl-flink-java-1.0-SNAPSHOT.jar \

  --disable-operator-chaining true

```

Running from IntelliJ IDEA requires to edit run configuration to add

dependencies of *provided* scope to classpath.

## Design

The design aims simplicity, reuse and maintainability, where components

(being that an operator, stream, source or sink) are *self-sufficient* and

*composable*.

This can be achieved with Java 8 functional interfaces, like

`Function` and `Consumer`.

### Operator

Operator, also known as a Flink function or transformation:

- Implements `Function, SingleOutputStreamOperator>` from

`java.util.function`. Used to add itself into a data stream.

- Implements functional interface that extends `Function` from

`org.apache.flink.api.common.functions`, e.g. `FlatMapFunction`, etc.,

or extends a rich equivalent from `AbstractRichFunction`, e.g.

`RichFlatMapFunction`, etc. Used to perform a transformation on stream

value.

Example (see [BookJsonDeserializerOperator.java](./src/main/java/sample/basic/operators/BookJsonDeserializerOperator.java)):

```java

public final class BookJsonDeserializerOperator

    implements

    Function<

        DataStream,

        SingleOutputStreamOperator>,

    MapFunction {

  // ...

  @Override

  public SingleOutputStreamOperator apply(DataStream in) {

    return in

        .map(this)

        .name("parse book from a json line");

  }

  @Override

  public Book map(String value) throws JsonProcessingException {

    return MAPPER.readValue(value, Book.class);

  }

}

```

### Source

Source (a data stream source):

- Implements `Function>`

from `java.util.function`. Used to add itself into `StreamExecutionEnvironment`,

the return type `DataStreamSource` is used to chain stream transformations.

Example (see [BookDataStreamSource.java](./src/main/java/sample/basic/sources/BookDataStreamSource.java)):

```java

public final class BookDataStreamSource

    implements Function> {

  // ...

  @Override

  public DataStreamSource apply(StreamExecutionEnvironment env) {

    Collection paths = scan(inputDir);

    return env.fromSource(

        FileSource.forRecordStreamFormat(

                new TextLineInputFormat(),

                paths.toArray(new Path[0]))

            .build(),

        WatermarkStrategy.noWatermarks(),

        "read source");

  }

}

```

### Sink

Sink (a final destination of stream transformations):

- Implements `Function, DataStreamSink` from

`java.util.function`. Used to add itself into `DataStream`.

Example (see [BookJdbcSink.java](./src/main/java/sample/basic/sinks/BookJdbcSink.java)):

```java

public final class BookJdbcSink

    implements

    Function, DataStreamSink> {

  // ...

  @Override

  public DataStreamSink apply(DataStream in) {

    return in

        .addSink(sink(executionOptions, connectionOptions))

        .name("persist to storage");

  }

}

```

### Stream

Stream (a Flink application, or a streaming dataflow):

- Exposes factory function `getStream(Options options)`. Used to pass

configuration options, e.g. `input-dir`, `db-url`, etc.

- Implements `Consumer` from

`java.util.function`. Used to add itself into `StreamExecutionEnvironment`.

- Implements `Function, SingleOutputStreamOperator>`.

Used to compose a streaming flow of operators.

Example (see [BooksIngestionStream.java](./src/main/java/sample/basic/streams/BooksIngestionStream.java)):

```java

public final class BooksIngestionStream

    implements

    Consumer,

    Function<

        DataStreamSource,

        SingleOutputStreamOperator> {

  // ...

  public static BooksIngestionStream getStream(Options options) {

    return new BooksIngestionStream(options);

  }

  @Override

  public void accept(StreamExecutionEnvironment env) {

    new BookDataStreamSource(options.inputDir)

        .andThen(this)

        .andThen(new BookJdbcSink(

            options.jdbc.execution,

            options.jdbc.connection))

        .apply(env);

  }

  @Override

  public SingleOutputStreamOperator apply(

      DataStreamSource source) {

    return new BookJsonDeserializerOperator()

        //.andThen(...)

        //.andThen(...)

        .apply(source);

  }

}

```

### Options

Options class represents a stream dataflow configuration, which is usually

obtained from the application command line args or similar:

- Use POJO.

- Exposes factory function `fromArgs(String[] args)`. Used to parse

configuration options and set sensible defaults.

Example (see [BooksIngestionStream.java](./src/main/java/sample/basic/streams/BooksIngestionStream.java)):

```java

public final class BooksIngestionStream {

  // ...

  public static class Options {

    public final Path inputDir;

    Options(ParameterTool params) {

      inputDir = new Path(

          Optional.ofNullable(params.get("input-dir")).orElse("./"));

    }

    public static Options fromArgs(String[] args) {

      return new Options(ParameterTool.fromArgs(args));

    }

  }

}

```

### Entry Point

This is an entry point of Java application to initialize and execute Flink job.

Example (see [BasicBooksIngestion.java](./src/main/java/sample/basic/BasicBooksIngestion.java)):

```java

public final class BasicBooksIngestion {

  public static void main(String[] args) throws Exception {

    StreamExecutionEnvironment env = StreamExecutionEnvironment

        .getExecutionEnvironment();

    BooksIngestionStream

        .getStream(BooksIngestionStream.Options.fromArgs(args))

        .accept(env);

    env.execute("Sample Books Basic ETL Job");

  }

}

```

## References

- [Amazon Popular Books Dataset](https://github.com/luminati-io/Amazon-popular-books-dataset)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/akornatskyy/sample-etl-flink-java

Awesome Lists containing this project

README