https://github.com/akornatskyy/sample-etl-flink-java
The sample ingests multiline gzipped files of popular books into postgres.
https://github.com/akornatskyy/sample-etl-flink-java
batch-processing etl flink ingestion java postgres sample
Last synced: 3 months ago
JSON representation
The sample ingests multiline gzipped files of popular books into postgres.
- Host: GitHub
- URL: https://github.com/akornatskyy/sample-etl-flink-java
- Owner: akornatskyy
- License: mit
- Created: 2023-10-08T13:18:37.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-26T11:15:38.000Z (3 months ago)
- Last Synced: 2025-01-26T12:20:19.717Z (3 months ago)
- Topics: batch-processing, etl, flink, ingestion, java, postgres, sample
- Language: Java
- Homepage:
- Size: 61.5 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# sample-etl-flink-java
[](https://github.com/akornatskyy/sample-etl-flink-java/actions/workflows/tests.yaml)
The sample ingests multiline gzipped files of popular books into postgres.
## Prerequisites
Ensure JDK 8 or 11 is installed in your system:
```sh
java -version
```Flink [runs](https://nightlies.apache.org/flink/flink-docs-stable/docs/try-flink/local_installation/)
on UNIX-like environments, for Windows install
[cygwin](https://www.cygwin.com/) (include *mintty* and *netcat* packages)
to emulate linux commands or use WSL (note, *bash for windows* doesn't work).Ensure the following (file `~/.bash_profile`):
```sh
# ignore windows line endings (skip \r)
export SHELLOPTS
set -o igncr
```Update the number of task slots that TaskManager offers and add id
(file `conf/flink-conf.yaml`):```yaml
taskmanager.numberOfTaskSlots: 4
taskmanager.resource-id: local
```Start cluster and navigate to the web UI at
[http://localhost:8081](http://localhost:8081):```sh
start-cluster.sh
```## Prepare
Download and prepare dataset (as a multiline JSON file):
```sh
curl -sL https://github.com/luminati-io/Amazon-popular-books-dataset/raw/main/Amazon_popular_books_dataset.json | \
jq -c '.[]' > dataset.json
```Split the input (multiline JSON) file into parts with 400 lines per output file
and compress with gzip:```sh
cat dataset.json | split -e -l400 -d --additional-suffix .json \
--filter='gzip > $FILE.gz' - part_
```## Postgres
There are a number of ways to run postgres, if you prefer to download binary and
run locally without installation, use the following steps:```sh
bin/initdb --pgdata=data/ -U postgres -E 'UTF-8' \
--lc-collate='en_US.UTF-8' --lc-ctype='en_US.UTF-8'
bin/postgres -D data/
```Create *books* database and apply schema from *./misc/schema.sql*.
## Run
Optionally specify *--input-dir* for a directory to scan for input and/or a
connection to postgres (*--db-url*).```sh
flink run -p 4 target/sample-etl-flink-java-1.0-SNAPSHOT.jarflink run -p 4 target/sample-etl-flink-java-1.0-SNAPSHOT.jar \
--input-dir ./ --db-url jdbc:postgresql://localhost:5432/books
```Use *--disable-operator-chaining true* to see expanded execution graph.
```sh
flink run -p 4 target/sample-etl-flink-java-1.0-SNAPSHOT.jar \
--disable-operator-chaining true
```Running from IntelliJ IDEA requires to edit run configuration to add
dependencies of *provided* scope to classpath.## Design
The design aims simplicity, reuse and maintainability, where components
(being that an operator, stream, source or sink) are *self-sufficient* and
*composable*.This can be achieved with Java 8 functional interfaces, like
`Function` and `Consumer`.### Operator
Operator, also known as a Flink function or transformation:
- Implements `Function, SingleOutputStreamOperator>` from
`java.util.function`. Used to add itself into a data stream.
- Implements functional interface that extends `Function` from
`org.apache.flink.api.common.functions`, e.g. `FlatMapFunction`, etc.,
or extends a rich equivalent from `AbstractRichFunction`, e.g.
`RichFlatMapFunction`, etc. Used to perform a transformation on stream
value.Example (see [BookJsonDeserializerOperator.java](./src/main/java/sample/basic/operators/BookJsonDeserializerOperator.java)):
```java
public final class BookJsonDeserializerOperator
implements
Function<
DataStream,
SingleOutputStreamOperator>,
MapFunction {// ...
@Override
public SingleOutputStreamOperator apply(DataStream in) {
return in
.map(this)
.name("parse book from a json line");
}@Override
public Book map(String value) throws JsonProcessingException {
return MAPPER.readValue(value, Book.class);
}
}
```### Source
Source (a data stream source):
- Implements `Function>`
from `java.util.function`. Used to add itself into `StreamExecutionEnvironment`,
the return type `DataStreamSource` is used to chain stream transformations.Example (see [BookDataStreamSource.java](./src/main/java/sample/basic/sources/BookDataStreamSource.java)):
```java
public final class BookDataStreamSource
implements Function> {// ...
@Override
public DataStreamSource apply(StreamExecutionEnvironment env) {
Collection paths = scan(inputDir);
return env.fromSource(
FileSource.forRecordStreamFormat(
new TextLineInputFormat(),
paths.toArray(new Path[0]))
.build(),
WatermarkStrategy.noWatermarks(),
"read source");
}
}
```### Sink
Sink (a final destination of stream transformations):
- Implements `Function, DataStreamSink` from
`java.util.function`. Used to add itself into `DataStream`.Example (see [BookJdbcSink.java](./src/main/java/sample/basic/sinks/BookJdbcSink.java)):
```java
public final class BookJdbcSink
implements
Function, DataStreamSink> {// ...
@Override
public DataStreamSink apply(DataStream in) {
return in
.addSink(sink(executionOptions, connectionOptions))
.name("persist to storage");
}
}
```### Stream
Stream (a Flink application, or a streaming dataflow):
- Exposes factory function `getStream(Options options)`. Used to pass
configuration options, e.g. `input-dir`, `db-url`, etc.
- Implements `Consumer` from
`java.util.function`. Used to add itself into `StreamExecutionEnvironment`.
- Implements `Function, SingleOutputStreamOperator>`.
Used to compose a streaming flow of operators.Example (see [BooksIngestionStream.java](./src/main/java/sample/basic/streams/BooksIngestionStream.java)):
```java
public final class BooksIngestionStream
implements
Consumer,
Function<
DataStreamSource,
SingleOutputStreamOperator> {// ...
public static BooksIngestionStream getStream(Options options) {
return new BooksIngestionStream(options);
}@Override
public void accept(StreamExecutionEnvironment env) {
new BookDataStreamSource(options.inputDir)
.andThen(this)
.andThen(new BookJdbcSink(
options.jdbc.execution,
options.jdbc.connection))
.apply(env);
}@Override
public SingleOutputStreamOperator apply(
DataStreamSource source) {
return new BookJsonDeserializerOperator()
//.andThen(...)
//.andThen(...)
.apply(source);
}
}
```### Options
Options class represents a stream dataflow configuration, which is usually
obtained from the application command line args or similar:- Use POJO.
- Exposes factory function `fromArgs(String[] args)`. Used to parse
configuration options and set sensible defaults.Example (see [BooksIngestionStream.java](./src/main/java/sample/basic/streams/BooksIngestionStream.java)):
```java
public final class BooksIngestionStream {// ...
public static class Options {
public final Path inputDir;Options(ParameterTool params) {
inputDir = new Path(
Optional.ofNullable(params.get("input-dir")).orElse("./"));
}public static Options fromArgs(String[] args) {
return new Options(ParameterTool.fromArgs(args));
}
}
}
```### Entry Point
This is an entry point of Java application to initialize and execute Flink job.
Example (see [BasicBooksIngestion.java](./src/main/java/sample/basic/BasicBooksIngestion.java)):
```java
public final class BasicBooksIngestion {public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();BooksIngestionStream
.getStream(BooksIngestionStream.Options.fromArgs(args))
.accept(env);env.execute("Sample Books Basic ETL Job");
}
}
```## References
- [Amazon Popular Books Dataset](https://github.com/luminati-io/Amazon-popular-books-dataset)