{"id":21839095,"url":"https://github.com/illuin-tech/data-pipeline","last_synced_at":"2026-04-01T17:25:37.508Z","repository":{"id":225869026,"uuid":"744559077","full_name":"illuin-tech/data-pipeline","owner":"illuin-tech","description":"Library for describing data transformation pipelines by compositing simple reusable components.","archived":false,"fork":false,"pushed_at":"2026-03-15T08:38:17.000Z","size":1464,"stargazers_count":6,"open_issues_count":25,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2026-03-15T21:39:55.585Z","etag":null,"topics":["data-pipeline","etl","java"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/illuin-tech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-01-17T14:48:13.000Z","updated_at":"2026-03-15T08:37:04.000Z","dependencies_parsed_at":"2024-12-09T10:22:54.237Z","dependency_job_id":"5d4ed8d5-37f2-4f4e-962a-33b813a33432","html_url":"https://github.com/illuin-tech/data-pipeline","commit_stats":null,"previous_names":["illuin-tech/data-pipeline"],"tags_count":68,"template":false,"template_full_name":null,"purl":"pkg:github/illuin-tech/data-pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/illuin-tech%2Fdata-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/illuin-tech%2Fdata-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/illuin-tech%2Fdata-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/illuin-tech%2Fdata-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/illuin-tech","download_url":"https://codeload.github.com/illuin-tech/data-pipeline/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/illuin-tech%2Fdata-pipeline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31290537,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T13:12:26.723Z","status":"ssl_error","status_checked_at":"2026-04-01T13:12:25.102Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-pipeline","etl","java"],"created_at":"2024-11-27T21:15:55.131Z","updated_at":"2026-04-01T17:25:37.500Z","avatar_url":"https://github.com/illuin-tech.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# data-pipeline\n\n[![Maven Build](https://github.com/illuin-tech/data-pipeline/actions/workflows/maven-build.yml/badge.svg?branch=master)](https://github.com/illuin/data-pipeline/actions/workflows/maven-build.yml)\n[![Maven Central Version](https://img.shields.io/maven-central/v/tech.illuin/data-pipeline)](https://central.sonatype.com/artifact/tech.illuin/data-pipeline)\n[![javadoc](https://javadoc.io/badge2/tech.illuin/data-pipeline/javadoc.svg)](https://javadoc.io/doc/tech.illuin/data-pipeline)\n[![codecov](https://codecov.io/gh/illuin-tech/data-pipeline/graph/badge.svg?token=T141JE2VMY)](https://codecov.io/gh/illuin-tech/data-pipeline)\n![GitHub](https://img.shields.io/github/license/illuin-tech/data-pipeline)\n\nThis library is a toolkit for describing data transformation pipelines by compositing simple reusable components.\n\nA typical `data-pipeline` use-case can be:\n* a system aggregating results from several external services: pipelines are modular, easily rearranged and each individual step can be padded with safety nets and error handling without affecting business logic\n* a system performing iterative analysis on an input: the `data-pipeline` data model retains intermediate results from all steps, and each result is tagged with lineage metadata\n\nOn top of its core feature-set, complying to the `data-pipeline` model comes with rather nice benefits:\n* out-of-the-box support for [micrometer](https://micrometer.io) based [metrics](doc/integrations.md#metrics) (success/failure rates, error tracking, etc.) and [tracing](doc/integrations.md#tracing)\n* out-of-the-box support for [slf4j](https://www.slf4j.org) log markers (pipeline id, component id, etc.)\n* easily pluggable [resilience4j](https://resilience4j.readme.io) based [resilience features](doc/integrations.md#resilience4j) (retries, time-limiter, etc.)\n\n## I. Installation\n\nThe library requires Java 17+, in order to use it, add the following in your `pom.xml`:\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003etech.illuin\u003c/groupId\u003e\n    \u003cartifactId\u003edata-pipeline\u003c/artifactId\u003e\n    \u003cversion\u003e0.30\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nAdditionally, some optional extension libraries can be added, at the time of this writing this includes:\n* `data-pipeline-resilience4j` (for the [`resilience4j` integration](doc/integrations.md#resilience4j))\n\n## II. Core Principles\n\nThe main goals behind its design were:\n1. having a fairly straightforward API and overall design: pipelines are made from compositing user-defined functions executed in a linear fashion\n2. leveraging this design to introduce useful features: data-model with lineage features, parallel execution, resilience features (retry, time-limiter, etc.), systematic performance metrics, error tracking, etc.\n\nA simplified high-level view of a `data-pipeline` pipeline looks like the following diagram:\n* the main phase is composed of `Step` functions, they return a `Result` and are expected to have no side effect (think of them as almost-pure functions)\n* the end phase is composed of `Sink` functions, they return `void` and are expected to induce side effects (e.g. database persistence, message queue push, etc.)\n\n```mermaid\nflowchart LR\n\nINPUT((input))\nOUTPUT((output))\n\nsubgraph PIPELINE[Pipeline]\n    direction LR\n\n    subgraph STEP_PHASE[Steps]\n        direction LR\n        STEP_1(Step 1):::step\n        STEP_2(Step 2):::step\n        STEP_N(...):::step\n    end\n\n    subgraph SINK_PHASE[Sinks]\n        direction LR\n        SINK_1(Sink 1):::sink\n        SINK_N(...):::sink\n    end\nend\n\nINPUT --\u003e PIPELINE --\u003e OUTPUT\nSTEP_PHASE --\u003e SINK_PHASE \nSTEP_1 --\u003e STEP_2 --\u003e STEP_N\n\nclassDef optional stroke-dasharray: 6 6;\nclassDef step stroke:#0f0;\nclassDef sink stroke:#f00;\n```\n\n## III. Basic Usage\n\nWe'll go through a quick example in order to demonstrate what `data-pipeline` looks like in action.   \n\nThe goal of this example is to have a simple pipeline for:\n* performing a basic tokenization of a sentence\n* performing a basic analysis of said tokens\n* recovering results and logging them out\n\n```mermaid\nflowchart LR\n\nINPUT((sentence))\n\nsubgraph PIPELINE[Pipeline]\n    direction LR\n\n    subgraph STEP_PHASE[Steps]\n        direction LR\n        TOKENIZER(Tokenizer):::step\n        MATCHER(Matcher):::step\n    end\n\n    subgraph SINK_PHASE[Sinks]\n        direction LR\n        MATCH_LOGGER(Match Logger):::sink\n    end\nend\n\nINPUT --\u003e PIPELINE\nSTEP_PHASE --\u003e SINK_PHASE\nTOKENIZER --\u003e MATCHER\n\nclassDef optional stroke-dasharray: 6 6;\nclassDef step stroke:#0f0;\nclassDef sink stroke:#f00;\n```\n\n### Defining steps and sinks\n\nFirst, we'll design the `Tokenizer` step, with a basic regex split.\n\nThree things to note here, which will remain true for the following pieces:\n* The step's entrypoint is annotated with `@StepConfig`, it will be identified at the pipeline build time\n* Some component inputs have to be annotated in order to narrow down their identity, the pipeline input can be supplied with the `@Input` annotation\n* The step output are expected to be a `Result` subtype, here we chose to go with a dedicated `record`\n\n```java\npublic class Tokenizer\n{\n    @StepConfig(id = \"tokenizer\")\n    public TokenizedSentence tokenize(@Input String sentence)\n    {\n        return new TokenizedSentence(Stream.of(sentence.split(\"[^\\\\p{L}]+\"))\n            .map(String::toLowerCase)\n            .toList()\n        );\n    }\n\n    public record TokenizedSentence(\n        List\u003cString\u003e tokens\n    ) implements Result {}\n}\n```\n\nNext up, the `Matcher` step, with a blacklist specified upon instantiation.\n\nIt will recover the tokenizer's output, and produce a `Matches` record of its findings.\n\nNote the `@Current` annotation for requesting the currently known value for tokenized sentence.\nThere is more to be said about the semantics of this annotation, which we'll cover in details in the [documentation](doc/result_data_model.md) section.\n\n```java\npublic class Matcher\n{\n    private final Set\u003cString\u003e blacklist;\n  \n    public Matcher(String... blacklist)\n    {\n        this.blacklist = Set.of(blacklist);\n    }\n  \n    @StepConfig(id = \"matcher\")\n    public Matches match(@Current TokenizedSentence tokenized)\n    {\n        long wordCount = tokenized.tokens().stream().distinct().count();\n        Set\u003cString\u003e matches = tokenized.tokens().stream()\n            .filter(this.blacklist::contains)\n            .collect(Collectors.toSet())\n        ;\n    \n        return new Matches(wordCount, matches);\n    }\n    \n    public record Matches(\n        long wordCount,\n        Set\u003cString\u003e blacklistMatches\n    ) implements Result {}\n}\n```\n\nFinally, our `MatchLogger` sink works very similarly, except we need the `@SinkConfig` annotation.\n\n```java\npublic class MatchLogger\n{\n    private static final Logger logger = LoggerFactory.getLogger(MatchLogger.class);\n\n    @SinkConfig(id = \"logger\")\n    public void log(@Current TokenizedSentence tokenized, @Current Matches matches)\n    {\n        logger.info(\"Found {} unique tokens in {}, with {} blacklisted {}\", matches.wordCount(), tokenized.tokens(), matches.blacklistMatches().size(), matches.blacklistMatches());\n    }\n}\n```\n\n### Setting-up the pipeline\n\nNow that we have all our building blocks, creating a `Pipeline` is simply a matter of combining them.\n\nThe `Pipeline` interface offers a builder initialization method, we'll start from there.\n\n```java\nPipeline\u003cString\u003e pipeline = Pipeline.\u003cString\u003eof(\"string-processor\")\n    .registerStep(new Tokenizer())\n    .registerStep(new Matcher(\"mostly\", \"relatively\"))\n    .registerSink(new MatchLogger())\n    .build()\n;\n```\n\nNow, calling the pipeline with some sentences:\n\n```java\npipeline.run(\"This is a relatively short and mostly meaningless sentence.\");\npipeline.run(\"This is a much longer sentence that should go through the blacklist unscathed.\");\npipeline.run(\"Relatively cool objects (temperatures less than several thousand degrees) emit their radiation primarily in the infrared, as described by Planck's law.\");\npipeline.run(\"The principles were deliberately non dogmatic, since the brotherhood wished to emphasise the personal responsibility of individual artists to determine their own ideas and methods of depiction.\");\npipeline.run(\"The Mystical Nativity, a relatively small and very personal painting, perhaps for his own use, appears to be dated to the end of 1500.\");\n```\n\nWe should get the following output (given a `simplelogger` or somesuch properly configured):\n\n```\n[main] INFO MatchLogger - Found 9 unique tokens in [this, is, a, relatively, short, and, mostly, meaningless, sentence], with 2 blacklisted [mostly, relatively]\n[main] INFO MatchLogger - Found 13 unique tokens in [this, is, a, much, longer, sentence, that, should, go, through, the, blacklist, unscathed], with 0 blacklisted []\n[main] INFO MatchLogger - Found 22 unique tokens in [relatively, cool, objects, temperatures, less, than, several, thousand, degrees, emit, their, radiation, primarily, in, the, infrared, as, described, by, planck, s, law], with 1 blacklisted [relatively]\n[main] INFO MatchLogger - Found 23 unique tokens in [the, principles, were, deliberately, non, dogmatic, since, the, brotherhood, wished, to, emphasise, the, personal, responsibility, of, individual, artists, to, determine, their, own, ideas, and, methods, of, depiction], with 0 blacklisted []\n[main] INFO MatchLogger - Found 21 unique tokens in [the, mystical, nativity, a, relatively, small, and, very, personal, painting, perhaps, for, his, own, use, appears, to, be, dated, to, the, end, of], with 1 blacklisted [relatively]\n```\n\nAs pipelines may use resources (notably a `ServiceExecutor` for the —optional— async sink execution), it is best to close it down when you are done using it (or consider using a `try-with` pattern):\n\n```java\npipeline.close();\n```\n\n## IV. Documentation\n\n* [Pipelines](doc/pipelines.md)\n  * [Configuration](doc/pipelines.md#configuration)\n  * [Execution](doc/pipelines.md#execution)\n  * [Shutting Down](doc/pipelines.md#shutting-down)\n* [Result Data Model](doc/result_data_model.md)\n* [Initializers](doc/initializers.md)\n* [Steps](doc/steps.md)\n* [Sinks](doc/sinks.md)\n* [Function Modifiers \u0026 Hooks](doc/modifiers_and_hooks.md)\n  * [Error Handlers](doc/modifiers_and_hooks.md#error-handlers)\n  * [Wrappers](doc/modifiers_and_hooks.md#wrappers)\n  * [UID Generators](doc/modifiers_and_hooks.md#uid-generators)\n  * [Author Resolvers](doc/modifiers_and_hooks.md#author-resolvers)\n  * [Tag Resolvers](doc/modifiers_and_hooks.md#tag-resolvers)\n* [Integrations](doc/integrations.md)\n  * [Micrometer](doc/integrations.md#micrometer)\n  * [Prometheus](doc/integrations.md#prometheus)\n  * [Grafana](doc/integrations.md#grafana)\n  * [Logback Loki](doc/integrations.md#logback-loki)\n  * [Resilience4j](doc/integrations.md#resilience4j)\n\n## V. Dev Installation\n\nThis project will require you to have the following:\n\n* Java 17+\n* Git (versioning)\n* Maven (dependency resolving, publishing and packaging) \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Filluin-tech%2Fdata-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Filluin-tech%2Fdata-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Filluin-tech%2Fdata-pipeline/lists"}