{"id":44905242,"url":"https://github.com/iscc/fastcdc-py","last_synced_at":"2026-02-17T22:28:58.268Z","repository":{"id":56093582,"uuid":"261983134","full_name":"iscc/fastcdc-py","owner":"iscc","description":"FastCDC implementation in Python https://pypi.org/project/fastcdc/","archived":false,"fork":false,"pushed_at":"2024-06-27T13:19:33.000Z","size":347,"stargazers_count":60,"open_issues_count":6,"forks_count":17,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-08-23T05:54:45.165Z","etag":null,"topics":["chunking","chunking-algorithm","content-dependent","deduplication","python"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iscc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"titusz"}},"created_at":"2020-05-07T07:41:25.000Z","updated_at":"2025-08-19T12:06:27.000Z","dependencies_parsed_at":"2024-05-09T17:48:34.309Z","dependency_job_id":"63c2ce41-2a6d-4956-a125-fa271787e3d8","html_url":"https://github.com/iscc/fastcdc-py","commit_stats":{"total_commits":79,"total_committers":4,"mean_commits":19.75,"dds":0.06329113924050633,"last_synced_commit":"be08441c2a431822f9ce801b8c6093c33e836a68"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/iscc/fastcdc-py","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iscc%2Ffastcdc-py","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iscc%2Ffastcdc-py/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iscc%2Ffastcdc-py/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iscc%2Ffastcdc-py/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iscc","download_url":"https://codeload.github.com/iscc/fastcdc-py/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iscc%2Ffastcdc-py/sbom","scorecard":{"id":495316,"data":{"date":"2025-08-11","repo":{"name":"github.com/iscc/fastcdc-py","commit":"081290186e5ffde720709bc038f837ee4307f809"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.3,"checks":[{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/test.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Code-Review","score":0,"reason":"Found 1/29 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/test.yml:13: update your workflow using https://app.stepsecurity.io/secureworkflow/iscc/fastcdc-py/test.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/test.yml:14: update your workflow using https://app.stepsecurity.io/secureworkflow/iscc/fastcdc-py/test.yml/master?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/test.yml:19: update your workflow using https://app.stepsecurity.io/secureworkflow/iscc/fastcdc-py/test.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/test.yml:33: update your workflow using https://app.stepsecurity.io/secureworkflow/iscc/fastcdc-py/test.yml/master?enable=pin","Info:   0 out of   3 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   1 third-party GitHubAction dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Vulnerabilities","score":9,"reason":"1 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GHSA-jfmj-5v4g-7637"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 2 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-19T20:18:48.758Z","repository_id":56093582,"created_at":"2025-08-19T20:18:48.758Z","updated_at":"2025-08-19T20:18:48.758Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29560562,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-17T21:50:49.831Z","status":"ssl_error","status_checked_at":"2026-02-17T21:46:15.313Z","response_time":100,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chunking","chunking-algorithm","content-dependent","deduplication","python"],"created_at":"2026-02-17T22:28:57.492Z","updated_at":"2026-02-17T22:28:58.237Z","avatar_url":"https://github.com/iscc.png","language":"Python","readme":"# FastCDC\n\n[![Tests](https://github.com/titusz/fastcdc-py/workflows/Tests/badge.svg)](https://github.com/titusz/fastcdc-py/actions?query=workflow%3ATests)\n[![Version](https://img.shields.io/pypi/v/fastcdc.svg)](https://pypi.python.org/pypi/fastcdc/)\n[![Downloads](https://pepy.tech/badge/fastcdc)](https://pepy.tech/project/fastcdc)\n\nThis package implements the \"FastCDC\" content defined chunking algorithm in\nPython with optional cython support. To learn more about content\ndefined chunking and its applications, see the reference material linked below.\n\n\n## Requirements\n\n* [Python](https://www.python.org/) Version 3.7 and later. Tested on Linux, Mac and\nWindows\n\n## Installing\n\n```shell\n$ pip install fastcdc\n```\n\nTo enable add additional support for the hash algorithms\n([xxhash](https://github.com/Cyan4973/xxHash) and\n[blake3](https://github.com/BLAKE3-team/BLAKE3/)) use\n\n```shell\n$ pip install fastcdc[hashes]\n```\n\n## Usage\n\n### Calculate chunks with default settings:\n```shell\n$ fastcdc tests/SekienAkashita.jpg\nhash=103159aa68bb1ea98f64248c647b8fe9a303365d80cb63974a73bba8bc3167d7 offset=0 size=22366\nhash=3f2b58dc77982e763e75db76c4205aaab4e18ff8929e298ca5c58500fee5530d offset=22366 size=10491\nhash=fcfb2f49ccb2640887a74fad1fb8a32368b5461a9dccc28f29ddb896b489b913 offset=32857 size=14094\nhash=bd1198535cdb87c5571378db08b6e886daf810873f5d77000a54795409464138 offset=46951 size=18696\nhash=d6347a2e5bf586d42f2d80559d4f4a2bf160dce8f77eede023ad2314856f3086 offset=65647 size=43819\n```\n\n### Customize min-size, avg-size, max-size, and hash function\n```shell\n$ fastcdc -mi 16384 -s 32768 -ma 65536 -hf sha256 tests/SekienAkashita.jpg\nhash=5a80871bad4588c7278d39707fe68b8b174b1aa54c59169d3c2c72f1e16ef46d offset=0 size=32857\nhash=13f6a4c6d42df2b76c138c13e86e1379c203445055c2b5f043a5f6c291fa520d offset=32857 size=16408\nhash=0fe7305ba21a5a5ca9f89962c5a6f3e29cd3e2b36f00e565858e0012e5f8df36 offset=49265 size=60201\n```\n\n###  Scan files in directory and report duplication.\n```shell\n$ fastcdc scan ~/Downloads\n[####################################]  100%\nFiles:          1,332\nChunk Sizes:    min 4096 - avg 16384 - max 131072\nUnique Chunks:  506,077\nTotal Data:     9.3 GB\nDupe Data:      873.8 MB\nDeDupe Ratio:   9.36 %\nThroughput:     135.2 MB/s\n```\n\n### Show help\n\n```shell\n$ fastcdc\nUsage: fastcdc [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  --version  Show the version and exit.\n  --help     Show this message and exit.\n\nCommands:\n  chunkify*  Find variable sized chunks for FILE and compute hashes.\n  benchmark  Benchmark chunking performance.\n  scan       Scan files in directory and report duplication.\n```\n\n### Use from your python code\nThe  tests also have some short examples of using the chunker, of which this\ncode snippet is an example:\n\n```python\nfrom fastcdc import fastcdc\n\nresults = list(fastcdc(\"tests/SekienAkashita.jpg\", 16384, 32768, 65536))\nassert len(results) == 3\nassert results[0].offset == 0\nassert results[0].length == 32857\nassert results[1].offset == 32857\nassert results[1].length == 16408\nassert results[2].offset == 49265\nassert results[2].length == 60201\n```\n\n## Reference Material\n\nThe algorithm is as described in \"FastCDC: a Fast and Efficient Content-Defined\nChunking Approach for Data Deduplication\"; see the\n[paper](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf),\nand\n[presentation](https://www.usenix.org/sites/default/files/conference/protected-files/atc16_slides_xia.pdf)\nfor details. There are some minor differences, as described below.\n\n### Differences with the FastCDC paper\n\nThe explanation below is copied from\n[ronomon/deduplication](https://github.com/ronomon/deduplication) since this\ncodebase is little more than a translation of that implementation:\n\n\u003e The following optimizations and variations on FastCDC are involved in the chunking algorithm:\n\u003e * 31 bit integers to avoid 64-bit integers for the sake of the Javascript reference implementation.\n\u003e * A right shift instead of a left shift to remove the need for an additional modulus operator, which would otherwise have been necessary to prevent overflow.\n\u003e * Masks are no longer zero-padded since a right shift is used instead of a left shift.\n\u003e * A more adaptive threshold based on a combination of average and minimum chunk size (rather than just average chunk size) to decide the pivot point at which to switch masks. A larger minimum chunk size now switches from the strict mask to the eager mask earlier.\n\u003e * Masks use 1 bit of chunk size normalization instead of 2 bits of chunk size normalization.\n\nThe primary objective of this codebase was to have a Python implementation with a\npermissive license, which could be used for new projects, without concern for\ndata parity with existing implementations.\n\n## Prior Art\n\nThis package started as Python port of the implementation by Nathan Fiedler (see the\nnlfiedler link below).\n\n* [nlfiedler/fastcdc-rs](https://github.com/nlfiedler/fastcdc-rs)\n    + Rust implementation on which this code is based.\n* [ronomon/deduplication](https://github.com/ronomon/deduplication)\n    + C++ and JavaScript implementation on which the rust implementation is based.\n* [rdedup_cdc at docs.rs](https://docs.rs/crate/rdedup-cdc/0.1.0/source/src/fastcdc.rs)\n    + An alternative implementation of FastCDC to the one in this crate.\n* [jrobhoward/quickcdc](https://github.com/jrobhoward/quickcdc)\n    + Similar but slightly earlier algorithm by some of the same researchers.\n\n## Change Log\n\n## [1.7.0] - 2024-06-27\n- Performance improvement [@dw](https://github.com/dw)\n- Fixed issue with inputs smaller than min_size [@grote](https://github.com/grote)\n\n## [1.6.0] - 2024-05-09\n- added python 3.12 support\n- removed python 3.7 support\n- updated dependencies\n\n## [1.5.0] - 2023-03-13\n- added python 3.10/3.11 support\n- removed python 3.6 support\n- update dependencies\n\n## [1.4.2] - 2020-11-25\n- add binary releases to PyPI (Xie Yanbo)\n- update dependencies\n\n## [1.4.1] - 2020-09-30\n- fix issue with fat option in cython version\n- updated dependencies\n\n## [1.4.0] - 2020-08-08\n- add support for multiple path with scan command\n- fix issue with building cython extension\n- fix issue with fat option\n- fix zero-devision error\n\n## [1.3.0] - 2020-06-26\n- add new `scan` command to calculate deduplication ratio for directories\n\n## [1.2.0] - 2020-05-23\n\n### Added\n- faster optional cython implementation\n- benchmark command\n\n## [1.1.0] - 2020-05-09\n\n### Added\n- high-level API\n- support for streams\n- support for custom hash functions\n\n\n## [1.0.0] - 2020-05-07\n\n### Added\n- Initial release (port of [nlfiedler/fastcdc-rs](https://github.com/nlfiedler/fastcdc-rs)).\n\n","funding_links":["https://github.com/sponsors/titusz"],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiscc%2Ffastcdc-py","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiscc%2Ffastcdc-py","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiscc%2Ffastcdc-py/lists"}