{"id":19982647,"url":"https://github.com/zabuzard/fastcdc4j","last_synced_at":"2026-03-05T21:09:59.158Z","repository":{"id":57735429,"uuid":"283296931","full_name":"Zabuzard/FastCDC4J","owner":"Zabuzard","description":"Fast and efficient content-defined chunking for data deduplication. Java implementation of FastCDC as library.","archived":false,"fork":false,"pushed_at":"2023-09-21T17:15:09.000Z","size":555,"stargazers_count":23,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-05-04T05:33:02.662Z","etag":null,"topics":["cdc","chunking","content-defined-chunking","data-deduplication","fastcdc","java","library"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Zabuzard.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-07-28T18:35:55.000Z","updated_at":"2025-04-12T08:35:39.000Z","dependencies_parsed_at":"2025-05-04T05:42:36.241Z","dependency_job_id":null,"html_url":"https://github.com/Zabuzard/FastCDC4J","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/Zabuzard/FastCDC4J","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zabuzard%2FFastCDC4J","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zabuzard%2FFastCDC4J/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zabuzard%2FFastCDC4J/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zabuzard%2FFastCDC4J/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Zabuzard","download_url":"https://codeload.github.com/Zabuzard/FastCDC4J/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zabuzard%2FFastCDC4J/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265887548,"owners_count":23844425,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cdc","chunking","content-defined-chunking","data-deduplication","fastcdc","java","library"],"created_at":"2024-11-13T04:12:26.227Z","updated_at":"2026-03-05T21:09:59.118Z","avatar_url":"https://github.com/Zabuzard.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FastCDC4J\n\n[![codefactor](https://img.shields.io/codefactor/grade/github/Zabuzard/FastCDC4J)](https://www.codefactor.io/repository/github/zabuzard/fastcdc4j)\n[![maven-central](https://img.shields.io/maven-central/v/io.github.zabuzard.fastcdc4j/fastcdc4j)](https://search.maven.org/search?q=g:io.github.zabuzard.fastcdc4j)\n[![javadoc](https://javadoc.io/badge2/io.github.zabuzard.fastcdc4j/fastcdc4j/javadoc.svg?style=flat\u0026color=AA82FF)](https://javadoc.io/doc/io.github.zabuzard.fastcdc4j/fastcdc4j)\n![Java](https://img.shields.io/badge/Java-14%2B-ff696c)\n[![license](https://img.shields.io/github/license/Zabuzard/FastCDC4J)](https://github.com/Zabuzard/FastCDC4J/blob/master/LICENSE)\n\nFastCDC4J is a fast and efficient content-defined chunking solution for\ndata deduplication implementing the FastCDC algorithm and offering the\nfunctionality as simple library.\n\nIt is able to split files into chunks, based on the content.\nChunks are created deterministic and will likely be preserved even if the\nfile is modified or data moved, hence it can be used for data deduplication.\nIt offers chunking of:\n\n* `InputStream`\n* `byte[]`\n* `Path`, including directory traversal\n* `Stream\u003cPath\u003e`\n\nBy utilizing the following built-in chunkers:\n\n* FastCDC - Wen Xia et al. ([publication](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf))\n* modified FastCDC - Nathan Fiedler ([source](https://github.com/nlfiedler/fastcdc-rs))\n* Fixed-Size-Chunking\n\nand providing a high degree of customizable by offering\nways to manipulate the algorithm.\n\nThe main interface of the chunkers provide the following methods:\n\n* `Iterable\u003cChunk\u003e chunk(InputStream stream, long size)`\n* `Iterable\u003cChunk\u003e chunk(final byte[] data)`\n* `Iterable\u003cChunk\u003e chunk(final Path path)`\n* `Iterable\u003cChunk\u003e chunk(final Stream\u003c? extends Path\u003e paths)`\n\n# Requirements\n\n* Requires at least **Java 14**\n\n# Download\n\nMaven:\n\n```xml\n\u003cdependency\u003e\n   \u003cgroupId\u003eio.github.zabuzard.fastcdc4j\u003c/groupId\u003e\n   \u003cartifactId\u003efastcdc4j\u003c/artifactId\u003e\n   \u003cversion\u003e1.3\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nJar downloads are available from the [release section](https://github.com/ZabuzaW/FastCDC4J/releases).\n\n# Documentation\n\n* [API Javadoc](https://javadoc.io/doc/io.github.zabuzard.fastcdc4j/fastcdc4j)\n  or alternatively from the [release section](https://github.com/ZabuzaW/FastCDC4J/releases)\n\n# Getting started\n\n1. Integrate **FastCDC4J** into your project.\n   The API is contained in the module `io.github.zabuzard.fastcdc4j`.\n4. Create a chunker using `ChunkerBuilder`\n5. Chunk files using the methods offered by `Chunker`\n\n# Examples\nSuppose you have a directory filled with lots of files that is frequently\nmodified and results have to be uploaded to a server.\nHowever, you want to skip upload for data that was already uploaded in the past.\n\nHence, you want to chunk your files and set up a local chunk file cache.\nIf a chunk is already contained from a previous upload, upload can be skipped.\n\n![Build example](https://i.imgur.com/kieqJtM.png)\n![Cache example](https://i.imgur.com/o41I3n3.png)\n\n```java\nvar buildPath = ...\nvar cachePath = ...\n\nvar chunker = new ChunkerBuilder().build();\nvar chunks = chunker.chunk(buildPath);\n\nfor (Chunk chunk : chunks) {\n    var chunkPath = cachePath.resolve(chunk.getHexHash());\n    if (!Files.exists(chunkPath)) {\n        Files.write(chunkPath, chunk.getData());\n        // Upload chunk ...\n    }\n}\n```\n\nEven if files in the build are modified or data is shifted around,\nchunks will likely be preserved, resulting in an efficient data deduplication.\n\n***\n\nDirectory traversal is executed single-threaded, however multi-threaded\nchunking on a per-file base can be implemented easily:\n\n```java\nConsumer\u003c? super Iterable\u003cChunk\u003e\u003e chunkAction = ...\n\n// Files.walk has poor multi-threading characteristics, use a List instead\nvar files = Files.walk(buildPath)\n    .filter(Files::isRegularFile)\n    .collect(Collectors.toList());\n\nfiles.parallelStream()\n    .map(chunker::chunk)\n    .forEach(chunkAction);\n```\n\n# Builder\n\nThe chunker builder `ChunkerBuilder` offers highly customizable algorithms.\nOffered built-in chunkers are:\n\n* `FastCDC`\n* `Nlfiedler Rust` - a modified variant of `FastCDC`\n* `Fixed Size Chunking`\n\nIt is also possible to add custom chunkers either by implementing\nthe interface `Chunker` or by implementing the simplified\ninterface `IterativeStreamChunkerCore`.\nA chunker can be set by using `setChunkerOption(ChunkerOption)`,\n`setChunkerCore(IterativeStreamChunkerCore)` and `setChunker(Chunker)`.\n\n***\n\nThe chunkers will try to strive for an expected chunk size\nsettable by `setExpectedChunkSize(int)`. A minimal size given\nby `setMinimalChunkSizeFactor(double)` and a maximal size given\nby `setMaximalChunkSizeFactor(double)`.\n\n***\n\nMost of the chunkers internally use a hash table as source for\npredicted noise to steer the algorithm, a custom table can be\nprovided by `setHashTable(long[])`.\nAlternatively, `setHashTableOption(HashTableOption)` can be used\nto choose from predefined tables:\n\n* `RTPal`\n* `Nlfiedler Rust`\n\n***\n\nThe algorithms are heavily steered by masks which define the cut-points.\nBy default, they are generated randomly using a fixed seed that can\nbe changed by using `setMaskGenerationSeed(long)`.\n\nThere are different techniques available to generate masks,\nthey can be set using `setMaskOption(MaskOption)`:\n* `FastCDC`\n* `Nlfiedler Rust`\n\nTo achieve a distribution of chunk sizes as close as possible to\nthe expected size, normalization levels are used during mask generation.\n`setNormalizationLevel(int)` is used to change the level.\nThe higher the level, the closer the sizes are to the expected size,\nfor the cost of a worse deduplication rate.\n\nAlternatively, masks can be set manually using `setMaskSmall(long)`\nfor the mask used when the chunk is still smaller than the expected\nsize and `setMaskLarge(long)` for bigger chunks respectively.\n\n***\n\nAfter a chunk has been read, a hash is generated based on its content.\nThe algorithm used for this process can be set by `setHashMethod(String)`,\nit has to be supported and accepted by `java.security.MessageDigest`.\n\n***\n\nFinally, a chunker using the selected properties can be created using `build()`.\n\nThe **default configuration** of the builder is:\n* Chunker option: `ChunkerOption#FAST_CDC`\n* Expected size: `8 * 1024`\n* Minimal size factor: `0.25`\n* Maximal size factor: `8`\n* Hash table option: `HashTableOption#RTPAL`\n* Mask generation seed: `941568351`\n* Mask option: `MaskOption#FAST_CDC`\n* Normalization level: `2`\n* Hash method: `SHA-1`\n\nThe methods `fastCdc()`, `nlFiedlerRust()` and `fsc()` can be used to\nget a configuration that uses the given algorithms as originally proposed.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzabuzard%2Ffastcdc4j","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzabuzard%2Ffastcdc4j","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzabuzard%2Ffastcdc4j/lists"}