{"id":26594280,"url":"https://github.com/q3769/chunk4j","last_synced_at":"2026-04-18T02:31:50.164Z","repository":{"id":44136635,"uuid":"431690020","full_name":"q3769/chunk4j","owner":"q3769","description":"A Java API to chop up larger data blobs into smaller \"chunks\" of a pre-defined size, and stitch the chunks back together to restore the original data when needed.","archived":false,"fork":false,"pushed_at":"2025-11-20T19:06:06.000Z","size":427,"stargazers_count":4,"open_issues_count":8,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-01-11T18:31:06.898Z","etag":null,"topics":["amazon-sqs","cache-size-limit","data-serialization","data-size-convert","data-transmission","distributed-systems","event-driven","java","java-message-service","java-serialization","kafka","memcached","message-oriented-middleware","message-size-limit","messaging","messaging-library","middleware","middleware-framework","redis"],"latest_commit_sha":null,"homepage":"https://q3769.github.io/chunk4j/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/q3769.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"q3769","patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"lfx_crowdfunding":null,"custom":null}},"created_at":"2021-11-25T02:28:52.000Z","updated_at":"2025-10-05T14:20:36.000Z","dependencies_parsed_at":"2024-04-19T20:31:26.926Z","dependency_job_id":null,"html_url":"https://github.com/q3769/chunk4j","commit_stats":null,"previous_names":["q3769/chunks"],"tags_count":41,"template":false,"template_full_name":null,"purl":"pkg:github/q3769/chunk4j","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/q3769%2Fchunk4j","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/q3769%2Fchunk4j/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/q3769%2Fchunk4j/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/q3769%2Fchunk4j/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/q3769","download_url":"https://codeload.github.com/q3769/chunk4j/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/q3769%2Fchunk4j/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31953769,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T00:39:45.007Z","status":"online","status_checked_at":"2026-04-18T02:00:07.018Z","response_time":103,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amazon-sqs","cache-size-limit","data-serialization","data-size-convert","data-transmission","distributed-systems","event-driven","java","java-message-service","java-serialization","kafka","memcached","message-oriented-middleware","message-size-limit","messaging","messaging-library","middleware","middleware-framework","redis"],"created_at":"2025-03-23T15:52:26.624Z","updated_at":"2026-04-18T02:31:50.158Z","avatar_url":"https://github.com/q3769.png","language":"Java","funding_links":["https://github.com/sponsors/q3769"],"categories":[],"sub_categories":[],"readme":"[![Maven Central](https://img.shields.io/maven-central/v/io.github.q3769/chunk4j.svg?label=Maven%20Central)](https://search.maven.org/search?q=g:%22io.github.q3769%22%20AND%20a:%22chunk4j%22)\n\n# chunk4j\n\nA Java API to chop up larger data blobs into smaller \"chunks\" of a pre-defined size, and stitch the chunks back together\nto restore the original data when needed.\n\n## User story\n\nAs a user of the chunk4j API, I want to chop a data blob (bytes) into smaller pieces of a pre-defined size and, when\nneeded, restore the original data by stitching the pieces back together.\n\nNotes:\n\n- The separate processes of \"chop\" and \"stitch\" often need to happen on different network compute nodes; and the\n  chunks are transported between the nodes in a possibly random order.\n- The chunk4j API comes in handy when, at run-time, you may have to ship larger sized data entries than what is allowed\n  by the underlying transport in a distributed system. E.g. at the time of writing, the default message size limit is\n  256KB with [Amazon Simple Queue Service (SQS)](https://aws.amazon.com/sqs/) , and 1MB\n  with [Apache Kafka](https://kafka.apache.org/); the default cache entry size limit is 1MB\n  for [Memcached](https://memcached.org/), and 512MB for [Redis](https://redis.io/). Although the default limits can be\n  customized, often times, the default is there for a sensible reason. Meanwhile, it may be difficult to predict or\n  ensure that the size of a data entry being transported at run-time will, by its business nature, never go beyond the\n  default or customized limit.\n\n## Prerequisite\n\n* Java 8+ for versions earlier than 20250321.0.0 (exclusive)\n* Java 21+ for versions later than 20250321.0.0 (inclusive)\n\n## Get it...\n\n[![Maven Central](https://img.shields.io/maven-central/v/io.github.q3769/chunk4j.svg?label=Maven%20Central)](https://search.maven.org/search?q=g:%22io.github.q3769%22%20AND%20a:%22chunk4j%22)\n\nInstall as a compile-scope dependency in Maven or other build tools alike.\n\n## Use it...\n\n- The implementation of chunk4j API is thread-safe.\n\n### The Chopper\n\n#### API:\n\n```java\n\n@FunctionalInterface\npublic interface Chopper {\n\n  /**\n   * Chops a byte array into a list of Chunk objects. Each Chunk object represents a portion of the original data\n   * blob. The size of each portion is determined by a pre-configured maximum size (the Chunk's capacity). If the size\n   * of the original data blob is smaller or equal to the Chunk's capacity, the returned list will contain only one\n   * Chunk.\n   *\n   * @param bytes the original data blob to be chopped into chunks\n   * @return the group of chunks which the original data blob is chopped into.\n   */\n  List\u003cChunk\u003e chop(byte[] bytes);\n}\n```\n\nA larger blob of data can be chopped up into smaller \"chunks\" to form a \"group\". When needed, often on a different\nnetwork node, the group of chunks can be collectively stitched back together to restore the original data.\n\n#### Usage example:\n\n```java\nimport chunk4j.Chunk;\n\npublic class MessageProducer {\n\n  private final Chopper chopper = ChunkChopper.ofByteSize(1024); // each chopped off chunk holds up to 1024 bytes\n\n  @Autowired\n  private MessagingTransport transport;\n\n  /**\n   * Sender method of business data\n   */\n  public void sendBusinessDomainData(String domainDataText) {\n    chopper.chop(domainDataText.getBytes()).forEach((chunk) -\u003e transport.send(chunkToMessage(chunk)));\n  }\n\n  /**\n   * pack/serialize/marshal the chunk POJO into a transport-specific message\n   */\n  private Message chunkToMessage(Chunk chunk) {\n    //...\n  }\n}\n```\n\nOn the `Chopper` side, you only have to say how big you want the chunks chopped up to be. The chopper will internally\ndivide up the original data bytes based on the chunk size you specified, and assign a unique group ID to all the chunks\nin the same group representing the original data unit.\n\n### The Chunk\n\n#### API:\n\n```java\n/**\n * The Chunk record represents a chunk of data that is part of a larger data blob. The data blob can\n * be chopped up into smaller chunks to form a group. When needed, often on a different network node\n * than the one where the data was chopped, the group of chunks can be collectively stitched back\n * together to restore the original data unit.\n *\n * @param groupId The unique identifier for the entire group of chunks representing the original\n *     data unit.\n * @param orderIndex The order index of this chunk within the chunk group.\n * @param groupSize The total number of chunks in the group.\n * @param bytes The byte array representing the data of this chunk.\n */\n@Builder\npublic record Chunk(@NonNull UUID groupId, int orderIndex, int groupSize, byte @NonNull [] bytes)\n    implements Serializable {\n\n  @Serial\n  private static final long serialVersionUID = -1879320933982945956L;\n\n  @Override\n  public boolean equals(Object o) {\n    if (this == o) {\n      return true;\n    }\n    if (o instanceof Chunk chunk) {\n      return orderIndex == chunk.orderIndex \u0026\u0026 Objects.equals(groupId, chunk.groupId);\n    }\n    return false;\n  }\n\n  @Override\n  public int hashCode() {\n    return Objects.hash(groupId, orderIndex);\n  }\n}\n```\n\n#### Usage example:\n\n`Chunk` is a simple POJO data holder, carrying a portion of the original data bytes from the `Chopper` to\nthe `Stitcher`. It is marked as a JDK `java.io.Serializable`. To transport a Chunk over the network, the API client is\nexpected to package the serialized byte array of the entire Chunk instance into a transport-specific (JSON, Kafka,\nJMS, ...) message on the Chopper's end (as in `MessageProducer#chunkToMessage` above), and unpack the bytes back to a\nChunk instance on the Stitcher's end (as in `MessageConsumer#messageToChunk` below). chunk4j will handle the rest of the\ndata assembly details.\n\n### The Stitcher\n\n#### API:\n\n```java\n\n@FunctionalInterface\npublic interface Stitcher {\n\n  /**\n   * Adds a chunk to its corresponding chunk group. If the chunk is the last one expected by the group, the original\n   * data bytes are restored and returned. Otherwise, the chunk group is kept around, waiting for the missing chunk(s)\n   * to arrive.\n   *\n   * @param chunk The chunk to be added to its corresponding chunk group.\n   * @return An Optional containing the original data bytes if the chunk is the last one expected by the group, or an\n   *     empty Optional otherwise.\n   */\n  Optional\u003cbyte[]\u003e stitch(Chunk chunk);\n}\n\n```\n\nOn the stitcher side, a group must gather all the previously chopped chunks before the original data blob represented by\nthis group can be stitched back together and restored.\n\n#### Usage example:\n\n```java\npublic class MessageConsumer {\n\n  private final Stitcher stitcher = new ChunkStitcher.Builder().build();\n\n  @Autowried\n  private DomainDataProcessor domainDataProcessor;\n\n  /**\n   * Suppose the run-time invocation of this method is managed by messaging provider/transport\n   */\n  public void onReceiving(Message message) {\n    stitcher.stitch(messageToChunk(message))\n        .ifPresent(originalDomainDataBytes -\u003e domainDataProcessor.process(new String(originalDomainDataBytes)));\n  }\n\n  /**\n   * unpack/deserialize/unmarshal the chunk POJO from the transport-specific message\n   */\n  private Chunk messageToChunk(Message message) {\n    //...\n  }\n}\n```\n\nIt is imperative that all received chunks be stitched by the same Stitcher instance. The instance's `stitch` method\nshould be repeatedly called on every chunk. With each call, if the input chunk is the last expected piece chopped from\nan original data unit, then the Stitcher returns a non-empty `Optional` containing the completely restored bytes of the\noriginal data unit. Otherwise, if the input chunk is not the last one expected of its original data unit, then the\nStitcher keeps the chunk aside and returns an empty `Optional`, indicating no original data unit can yet be restored.\n\nThe `stitch` method will only return each restored data unit once. The API client should process or retain each returned\ndata unit as the Stitcher will not keep around any already-returned data unit.\n\nThe same Stitcher instance keeps/caches all the \"pending\" chunks received via the `stitch` method in different groups;\neach group represents one original data unit. When an incoming chunk renders its own corresponding group \"complete\" -\nthat is, with that chunk, the group has gathered all the chunks of the original data unit, then\n\n- The entire group of chunks is stitched to restore the original data bytes;\n- The complete group of chunks is evicted from the Stitcher's cache;\n- The restored bytes from the evicted group are returned in an `Optional` that is non-empty, indicating the data\n  contained inside is a complete restore of the original.\n\nBy default, a stitcher caches unlimited groups of pending chunks, and a pending group of chunks will never be discarded\nno matter how much time has passed while awaiting all the chunks of the original data unit to arrive:\n\n```jshelllanguage\nnew ChunkStitcher.Builder().build()\n```\n\nBoth of those aspects, though, can be customized.\n\nThe following stitcher will discard a group of chunks if 5 seconds of\ntime have passed since the stitcher was asked to stitch the very first chunk of the group, but hasn't received all the\nchunks needed to restore the whole group back to the original data unit:\n\n```jshelllanguage\nnew ChunkStitcher.Builder().maxStitchTime(Duration.ofSeonds(5)).build()\n```\n\nThe following stitcher will discard some group(s) of chunks when there are more than 100 groups of original data pending\nrestoration:\n\n```jshelllanguage\nnew ChunkStitcher.Builder().maxStitchingGroups(100).build()\n```\n\nThe following stitcher is customized by a combination of both aspects:\n\n```jshelllanguage\nnew ChunkStitcher.Builder().maxStitchTime(Duration.ofSeconds(5)).maxStitchingGroups(100).build()\n```\n\n### Hints on using chunk4j API in messaging\n\n#### Chunk size/capacity\n\nchunk4j works on the application layer of the network (Layer 7). There is a small fixed-size overhead in addition to\na chunk's byte size to serialize the entire Chunk object. Take all possible overheads into account when designing\nto keep the **overall** message size under the transport limit.\n\n#### Message acknowledgment/commit\n\nWhen working with a messaging provider, you want to (at least semantically) acknowledge/commit all the messages of an\nentire group of chunks in an all-or-nothing fashion. Otherwise, in case of a consumer node crash, it is possible that\nthe group loses chunks. The loss of any chunk in a group semantically equates the loss of the entire group, thus the\nwhole original data unit.\n\nUsually, data integrity is not an issue with a messaging provider that supports the at-least-once delivery guarantee\nto the message consumer. The Stitcher of chunk4j is tolerant and will seamlessly discard repeatedly delivered chunks. As\nlong as there is no loss of chunks per the messaging provider guarantee, the overall data integrity is assured by the\nchunk4j consumer.\n\nHowever, because the data stored by the chunk4j Stitcher at runtime is in memory-only and not persistent, the Stitcher\nwill lose all the incomplete groups of chunks if the message consumer node crashes. To avoid or recover from such loss,\nthe options include:\n\n* **The Preventative Approach**: Use a third-party API or implement a custom mechanism that works with your specific\n  messaging provider to achieve the all-or-nothing acknowledgment/commit semantics for the group. For example, in case\n  of TIBCO EMS, the provider support of individual explicit message acknowledgment can make this trivial to implement.\n* **The Curative Approach**: Make the loss of original data units detectable and recoverable by the application logic.\n  For example, implement a mechanism to detect the loss of an original data unit (e.g. via timeout checks), and re-send\n  the data unit (missing chunks) from the producer.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fq3769%2Fchunk4j","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fq3769%2Fchunk4j","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fq3769%2Fchunk4j/lists"}