An open API service indexing awesome lists of open source software.

https://github.com/akornatskyy/sample-etl-flink-java

The sample ingests multiline gzipped files of popular books into postgres.
https://github.com/akornatskyy/sample-etl-flink-java

batch-processing etl flink ingestion java postgres sample

Last synced: 3 months ago
JSON representation

The sample ingests multiline gzipped files of popular books into postgres.

Awesome Lists containing this project

README

        

# sample-etl-flink-java

[![tests](https://github.com/akornatskyy/sample-etl-flink-java/actions/workflows/tests.yaml/badge.svg)](https://github.com/akornatskyy/sample-etl-flink-java/actions/workflows/tests.yaml)

The sample ingests multiline gzipped files of popular books into postgres.

## Prerequisites

Ensure JDK 8 or 11 is installed in your system:

```sh
java -version
```

Flink [runs](https://nightlies.apache.org/flink/flink-docs-stable/docs/try-flink/local_installation/)
on UNIX-like environments, for Windows install
[cygwin](https://www.cygwin.com/) (include *mintty* and *netcat* packages)
to emulate linux commands or use WSL (note, *bash for windows* doesn't work).

Ensure the following (file `~/.bash_profile`):

```sh
# ignore windows line endings (skip \r)
export SHELLOPTS
set -o igncr
```

Update the number of task slots that TaskManager offers and add id
(file `conf/flink-conf.yaml`):

```yaml
taskmanager.numberOfTaskSlots: 4
taskmanager.resource-id: local
```

Start cluster and navigate to the web UI at
[http://localhost:8081](http://localhost:8081):

```sh
start-cluster.sh
```

## Prepare

Download and prepare dataset (as a multiline JSON file):

```sh
curl -sL https://github.com/luminati-io/Amazon-popular-books-dataset/raw/main/Amazon_popular_books_dataset.json | \
jq -c '.[]' > dataset.json
```

Split the input (multiline JSON) file into parts with 400 lines per output file
and compress with gzip:

```sh
cat dataset.json | split -e -l400 -d --additional-suffix .json \
--filter='gzip > $FILE.gz' - part_
```

## Postgres

There are a number of ways to run postgres, if you prefer to download binary and
run locally without installation, use the following steps:

```sh
bin/initdb --pgdata=data/ -U postgres -E 'UTF-8' \
--lc-collate='en_US.UTF-8' --lc-ctype='en_US.UTF-8'
bin/postgres -D data/
```

Create *books* database and apply schema from *./misc/schema.sql*.

## Run

Optionally specify *--input-dir* for a directory to scan for input and/or a
connection to postgres (*--db-url*).

```sh
flink run -p 4 target/sample-etl-flink-java-1.0-SNAPSHOT.jar

flink run -p 4 target/sample-etl-flink-java-1.0-SNAPSHOT.jar \
--input-dir ./ --db-url jdbc:postgresql://localhost:5432/books
```

Use *--disable-operator-chaining true* to see expanded execution graph.

```sh
flink run -p 4 target/sample-etl-flink-java-1.0-SNAPSHOT.jar \
--disable-operator-chaining true
```

Running from IntelliJ IDEA requires to edit run configuration to add
dependencies of *provided* scope to classpath.

## Design

The design aims simplicity, reuse and maintainability, where components
(being that an operator, stream, source or sink) are *self-sufficient* and
*composable*.

This can be achieved with Java 8 functional interfaces, like
`Function` and `Consumer`.

### Operator

Operator, also known as a Flink function or transformation:

- Implements `Function, SingleOutputStreamOperator>` from
`java.util.function`. Used to add itself into a data stream.
- Implements functional interface that extends `Function` from
`org.apache.flink.api.common.functions`, e.g. `FlatMapFunction`, etc.,
or extends a rich equivalent from `AbstractRichFunction`, e.g.
`RichFlatMapFunction`, etc. Used to perform a transformation on stream
value.

Example (see [BookJsonDeserializerOperator.java](./src/main/java/sample/basic/operators/BookJsonDeserializerOperator.java)):

```java
public final class BookJsonDeserializerOperator
implements
Function<
DataStream,
SingleOutputStreamOperator>,
MapFunction {

// ...

@Override
public SingleOutputStreamOperator apply(DataStream in) {
return in
.map(this)
.name("parse book from a json line");
}

@Override
public Book map(String value) throws JsonProcessingException {
return MAPPER.readValue(value, Book.class);
}
}
```

### Source

Source (a data stream source):

- Implements `Function>`
from `java.util.function`. Used to add itself into `StreamExecutionEnvironment`,
the return type `DataStreamSource` is used to chain stream transformations.

Example (see [BookDataStreamSource.java](./src/main/java/sample/basic/sources/BookDataStreamSource.java)):

```java
public final class BookDataStreamSource
implements Function> {

// ...

@Override
public DataStreamSource apply(StreamExecutionEnvironment env) {
Collection paths = scan(inputDir);
return env.fromSource(
FileSource.forRecordStreamFormat(
new TextLineInputFormat(),
paths.toArray(new Path[0]))
.build(),
WatermarkStrategy.noWatermarks(),
"read source");
}
}
```

### Sink

Sink (a final destination of stream transformations):

- Implements `Function, DataStreamSink` from
`java.util.function`. Used to add itself into `DataStream`.

Example (see [BookJdbcSink.java](./src/main/java/sample/basic/sinks/BookJdbcSink.java)):

```java
public final class BookJdbcSink
implements
Function, DataStreamSink> {

// ...

@Override
public DataStreamSink apply(DataStream in) {
return in
.addSink(sink(executionOptions, connectionOptions))
.name("persist to storage");
}
}
```

### Stream

Stream (a Flink application, or a streaming dataflow):

- Exposes factory function `getStream(Options options)`. Used to pass
configuration options, e.g. `input-dir`, `db-url`, etc.
- Implements `Consumer` from
`java.util.function`. Used to add itself into `StreamExecutionEnvironment`.
- Implements `Function, SingleOutputStreamOperator>`.
Used to compose a streaming flow of operators.

Example (see [BooksIngestionStream.java](./src/main/java/sample/basic/streams/BooksIngestionStream.java)):

```java
public final class BooksIngestionStream
implements
Consumer,
Function<
DataStreamSource,
SingleOutputStreamOperator> {

// ...

public static BooksIngestionStream getStream(Options options) {
return new BooksIngestionStream(options);
}

@Override
public void accept(StreamExecutionEnvironment env) {
new BookDataStreamSource(options.inputDir)
.andThen(this)
.andThen(new BookJdbcSink(
options.jdbc.execution,
options.jdbc.connection))
.apply(env);
}

@Override
public SingleOutputStreamOperator apply(
DataStreamSource source) {
return new BookJsonDeserializerOperator()
//.andThen(...)
//.andThen(...)
.apply(source);
}
}
```

### Options

Options class represents a stream dataflow configuration, which is usually
obtained from the application command line args or similar:

- Use POJO.
- Exposes factory function `fromArgs(String[] args)`. Used to parse
configuration options and set sensible defaults.

Example (see [BooksIngestionStream.java](./src/main/java/sample/basic/streams/BooksIngestionStream.java)):

```java
public final class BooksIngestionStream {

// ...

public static class Options {
public final Path inputDir;

Options(ParameterTool params) {
inputDir = new Path(
Optional.ofNullable(params.get("input-dir")).orElse("./"));
}

public static Options fromArgs(String[] args) {
return new Options(ParameterTool.fromArgs(args));
}
}
}
```

### Entry Point

This is an entry point of Java application to initialize and execute Flink job.

Example (see [BasicBooksIngestion.java](./src/main/java/sample/basic/BasicBooksIngestion.java)):

```java
public final class BasicBooksIngestion {

public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();

BooksIngestionStream
.getStream(BooksIngestionStream.Options.fromArgs(args))
.accept(env);

env.execute("Sample Books Basic ETL Job");
}
}
```

## References

- [Amazon Popular Books Dataset](https://github.com/luminati-io/Amazon-popular-books-dataset)