{"id":21974080,"url":"https://github.com/akornatskyy/sample-etl-flink-java","last_synced_at":"2026-02-09T13:03:24.069Z","repository":{"id":199073165,"uuid":"702089877","full_name":"akornatskyy/sample-etl-flink-java","owner":"akornatskyy","description":"The sample ingests multiline gzipped files of popular books into postgres.","archived":false,"fork":false,"pushed_at":"2025-01-26T11:15:38.000Z","size":63,"stargazers_count":2,"open_issues_count":4,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-10T13:34:12.213Z","etag":null,"topics":["batch-processing","etl","flink","ingestion","java","postgres","sample"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/akornatskyy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-10-08T13:18:37.000Z","updated_at":"2025-01-26T11:15:41.000Z","dependencies_parsed_at":"2024-04-13T11:39:22.362Z","dependency_job_id":"a7f703be-14b6-40c0-b5b7-c0b306245c2a","html_url":"https://github.com/akornatskyy/sample-etl-flink-java","commit_stats":null,"previous_names":["akornatskyy/sample-etl-flink-java"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/akornatskyy/sample-etl-flink-java","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akornatskyy%2Fsample-etl-flink-java","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akornatskyy%2Fsample-etl-flink-java/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akornatskyy%2Fsample-etl-flink-java/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akornatskyy%2Fsample-etl-flink-java/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/akornatskyy","download_url":"https://codeload.github.com/akornatskyy/sample-etl-flink-java/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akornatskyy%2Fsample-etl-flink-java/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29266118,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-09T12:53:16.161Z","status":"ssl_error","status_checked_at":"2026-02-09T12:52:30.244Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["batch-processing","etl","flink","ingestion","java","postgres","sample"],"created_at":"2024-11-29T15:37:36.245Z","updated_at":"2026-02-09T13:03:24.026Z","avatar_url":"https://github.com/akornatskyy.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# sample-etl-flink-java\n\n[![tests](https://github.com/akornatskyy/sample-etl-flink-java/actions/workflows/tests.yaml/badge.svg)](https://github.com/akornatskyy/sample-etl-flink-java/actions/workflows/tests.yaml)\n\nThe sample ingests multiline gzipped files of popular books into postgres.\n\n## Prerequisites\n\nEnsure JDK 8 or 11 is installed in your system:\n\n```sh\njava -version\n```\n\nFlink [runs](https://nightlies.apache.org/flink/flink-docs-stable/docs/try-flink/local_installation/)\non UNIX-like environments, for Windows install\n[cygwin](https://www.cygwin.com/) (include *mintty* and *netcat* packages)\nto emulate linux commands or use WSL (note, *bash for windows* doesn't work).\n\nEnsure the following (file `~/.bash_profile`):\n\n```sh\n# ignore windows line endings (skip \\r)\nexport SHELLOPTS\nset -o igncr\n```\n\nUpdate the number of task slots that TaskManager offers and add id\n(file `conf/flink-conf.yaml`):\n\n```yaml\ntaskmanager.numberOfTaskSlots: 4\ntaskmanager.resource-id: local\n```\n\nStart cluster and navigate to the web UI at\n[http://localhost:8081](http://localhost:8081):\n\n```sh\nstart-cluster.sh\n```\n\n## Prepare\n\nDownload and prepare dataset (as a multiline JSON file):\n\n```sh\ncurl -sL https://github.com/luminati-io/Amazon-popular-books-dataset/raw/main/Amazon_popular_books_dataset.json | \\\n  jq -c '.[]' \u003e dataset.json\n```\n\nSplit the input (multiline JSON) file into parts with 400 lines per output file\nand compress with gzip:\n\n```sh\ncat dataset.json | split -e -l400 -d --additional-suffix .json \\\n  --filter='gzip \u003e $FILE.gz' - part_\n```\n\n## Postgres\n\nThere are a number of ways to run postgres, if you prefer to download binary and\nrun locally without installation, use the following steps:\n\n```sh\nbin/initdb --pgdata=data/ -U postgres -E 'UTF-8' \\\n  --lc-collate='en_US.UTF-8' --lc-ctype='en_US.UTF-8'\nbin/postgres -D data/\n```\n\nCreate *books* database and apply schema from *./misc/schema.sql*.\n\n## Run\n\nOptionally specify *--input-dir* for a directory to scan for input and/or a\nconnection to postgres (*--db-url*).\n\n```sh\nflink run -p 4 target/sample-etl-flink-java-1.0-SNAPSHOT.jar\n\nflink run -p 4 target/sample-etl-flink-java-1.0-SNAPSHOT.jar \\\n  --input-dir ./ --db-url jdbc:postgresql://localhost:5432/books\n```\n\nUse *--disable-operator-chaining true* to see expanded execution graph.\n\n```sh\nflink run -p 4 target/sample-etl-flink-java-1.0-SNAPSHOT.jar \\\n  --disable-operator-chaining true\n```\n\nRunning from IntelliJ IDEA requires to edit run configuration to add\ndependencies of *provided* scope to classpath.\n\n## Design\n\nThe design aims simplicity, reuse and maintainability, where components\n(being that an operator, stream, source or sink) are *self-sufficient* and\n*composable*.\n\nThis can be achieved with Java 8 functional interfaces, like\n`Function\u003cIN, OUT\u003e` and `Consumer\u003cT\u003e`.\n\n### Operator\n\nOperator, also known as a Flink function or transformation:\n\n- Implements `Function\u003cDataStream\u003cIN\u003e, SingleOutputStreamOperator\u003cOUT\u003e\u003e` from\n`java.util.function`. Used to add itself into a data stream.\n- Implements functional interface that extends `Function` from\n`org.apache.flink.api.common.functions`, e.g. `FlatMapFunction\u003cIN, OOUT\u003e`, etc.,\nor extends a rich equivalent from `AbstractRichFunction`, e.g.\n`RichFlatMapFunction\u003cIN, OUT\u003e`, etc. Used to perform a transformation on stream\nvalue.\n\nExample (see [BookJsonDeserializerOperator.java](./src/main/java/sample/basic/operators/BookJsonDeserializerOperator.java)):\n\n```java\npublic final class BookJsonDeserializerOperator\n    implements\n    Function\u003c\n        DataStream\u003cString\u003e,\n        SingleOutputStreamOperator\u003cBook\u003e\u003e,\n    MapFunction\u003cString, Book\u003e {\n\n  // ...\n\n  @Override\n  public SingleOutputStreamOperator\u003cBook\u003e apply(DataStream\u003cString\u003e in) {\n    return in\n        .map(this)\n        .name(\"parse book from a json line\");\n  }\n\n  @Override\n  public Book map(String value) throws JsonProcessingException {\n    return MAPPER.readValue(value, Book.class);\n  }\n}\n```\n\n### Source\n\nSource (a data stream source):\n\n- Implements `Function\u003cStreamExecutionEnvironment, DataStreamSource\u003cT\u003e\u003e`\nfrom `java.util.function`. Used to add itself into `StreamExecutionEnvironment`,\nthe return type `DataStreamSource\u003cT\u003e` is used to chain stream transformations.\n\nExample (see [BookDataStreamSource.java](./src/main/java/sample/basic/sources/BookDataStreamSource.java)):\n\n```java\npublic final class BookDataStreamSource\n    implements Function\u003cStreamExecutionEnvironment, DataStreamSource\u003cString\u003e\u003e {\n\n  // ...\n\n  @Override\n  public DataStreamSource\u003cString\u003e apply(StreamExecutionEnvironment env) {\n    Collection\u003cPath\u003e paths = scan(inputDir);\n    return env.fromSource(\n        FileSource.forRecordStreamFormat(\n                new TextLineInputFormat(),\n                paths.toArray(new Path[0]))\n            .build(),\n        WatermarkStrategy.noWatermarks(),\n        \"read source\");\n  }\n}\n```\n\n### Sink\n\nSink (a final destination of stream transformations):\n\n- Implements `Function\u003cDataStream\u003cT\u003e, DataStreamSink\u003cT\u003e` from\n`java.util.function`. Used to add itself into `DataStream\u003cT\u003e`.\n\nExample (see [BookJdbcSink.java](./src/main/java/sample/basic/sinks/BookJdbcSink.java)):\n\n```java\npublic final class BookJdbcSink\n    implements\n    Function\u003cDataStream\u003cBook\u003e, DataStreamSink\u003cBook\u003e\u003e {\n\n  // ...\n\n  @Override\n  public DataStreamSink\u003cBook\u003e apply(DataStream\u003cBook\u003e in) {\n    return in\n        .addSink(sink(executionOptions, connectionOptions))\n        .name(\"persist to storage\");\n  }\n}\n```\n\n### Stream\n\nStream (a Flink application, or a streaming dataflow):\n\n- Exposes factory function `getStream(Options options)`. Used to pass\nconfiguration options, e.g. `input-dir`, `db-url`, etc.\n- Implements `Consumer\u003cStreamExecutionEnvironment\u003e` from\n`java.util.function`. Used to add itself into `StreamExecutionEnvironment`.\n- Implements `Function\u003cDataStreamSource\u003cIN\u003e, SingleOutputStreamOperator\u003cT\u003e\u003e`.\nUsed to compose a streaming flow of operators.\n\nExample (see [BooksIngestionStream.java](./src/main/java/sample/basic/streams/BooksIngestionStream.java)):\n\n```java\npublic final class BooksIngestionStream\n    implements\n    Consumer\u003cStreamExecutionEnvironment\u003e,\n    Function\u003c\n        DataStreamSource\u003cString\u003e,\n        SingleOutputStreamOperator\u003cBook\u003e\u003e {\n\n  // ...\n\n  public static BooksIngestionStream getStream(Options options) {\n    return new BooksIngestionStream(options);\n  }\n\n  @Override\n  public void accept(StreamExecutionEnvironment env) {\n    new BookDataStreamSource(options.inputDir)\n        .andThen(this)\n        .andThen(new BookJdbcSink(\n            options.jdbc.execution,\n            options.jdbc.connection))\n        .apply(env);\n  }\n\n  @Override\n  public SingleOutputStreamOperator\u003cBook\u003e apply(\n      DataStreamSource\u003cString\u003e source) {\n    return new BookJsonDeserializerOperator()\n        //.andThen(...)\n        //.andThen(...)\n        .apply(source);\n  }\n}\n```\n\n### Options\n\nOptions class represents a stream dataflow configuration, which is usually\nobtained from the application command line args or similar:\n\n- Use POJO.\n- Exposes factory function `fromArgs(String[] args)`. Used to parse\nconfiguration options and set sensible defaults.\n\nExample (see [BooksIngestionStream.java](./src/main/java/sample/basic/streams/BooksIngestionStream.java)):\n\n```java\npublic final class BooksIngestionStream {\n\n  // ...\n\n  public static class Options {\n    public final Path inputDir;\n\n    Options(ParameterTool params) {\n      inputDir = new Path(\n          Optional.ofNullable(params.get(\"input-dir\")).orElse(\"./\"));\n    }\n\n    public static Options fromArgs(String[] args) {\n      return new Options(ParameterTool.fromArgs(args));\n    }\n  }\n}\n```\n\n### Entry Point\n\nThis is an entry point of Java application to initialize and execute Flink job.\n\nExample (see [BasicBooksIngestion.java](./src/main/java/sample/basic/BasicBooksIngestion.java)):\n\n```java\npublic final class BasicBooksIngestion {\n\n  public static void main(String[] args) throws Exception {\n    StreamExecutionEnvironment env = StreamExecutionEnvironment\n        .getExecutionEnvironment();\n\n    BooksIngestionStream\n        .getStream(BooksIngestionStream.Options.fromArgs(args))\n        .accept(env);\n\n    env.execute(\"Sample Books Basic ETL Job\");\n  }\n}\n```\n\n## References\n\n- [Amazon Popular Books Dataset](https://github.com/luminati-io/Amazon-popular-books-dataset)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakornatskyy%2Fsample-etl-flink-java","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fakornatskyy%2Fsample-etl-flink-java","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakornatskyy%2Fsample-etl-flink-java/lists"}