{"id":34553674,"url":"https://github.com/ross-spencer/sumfolder1","last_synced_at":"2025-12-24T08:21:54.505Z","repository":{"id":144696698,"uuid":"585151971","full_name":"ross-spencer/sumfolder1","owner":"ross-spencer","description":"What is the checksum of a directory?","archived":false,"fork":false,"pushed_at":"2024-03-25T09:27:38.000Z","size":114,"stargazers_count":8,"open_issues_count":3,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-29T18:42:12.307Z","etag":null,"topics":["authenticity","checksum","code4lib","digipres","digital-preservation","merkle-tree","pronom"],"latest_commit_sha":null,"homepage":"https://openpreservation.org/blogs/what-is-the-checksum-of-a-directory/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ross-spencer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-01-04T12:59:59.000Z","updated_at":"2025-05-14T23:29:27.000Z","dependencies_parsed_at":"2024-03-25T10:52:15.552Z","dependency_job_id":null,"html_url":"https://github.com/ross-spencer/sumfolder1","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/ross-spencer/sumfolder1","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ross-spencer%2Fsumfolder1","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ross-spencer%2Fsumfolder1/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ross-spencer%2Fsumfolder1/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ross-spencer%2Fsumfolder1/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ross-spencer","download_url":"https://codeload.github.com/ross-spencer/sumfolder1/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ross-spencer%2Fsumfolder1/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27998479,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-24T02:00:07.193Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["authenticity","checksum","code4lib","digipres","digital-preservation","merkle-tree","pronom"],"created_at":"2025-12-24T08:21:54.037Z","updated_at":"2025-12-24T08:21:54.495Z","avatar_url":"https://github.com/ross-spencer.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- markdownlint-disable --\u003e\n\u003cp align=\"center\"\u003e\n  \u003cimg\n   width=\"786\"\n   height=\"204\"\n   alt=\"Logo for sumfolder1\"\n   src=\"https://raw.githubusercontent.com/ross-spencer/sumfolder1/main/logo/sumfolder1.png\"\u003e\n\u003c/p\u003e\n\u003c!-- markdownlint-enable --\u003e\n\nsumfolder1 is a utility for use within the archival and digital preservation\ncommunity to generate checksums for file system directories, and to generate\nan overall \"collection\" checksum for a given set of files.\n\n\u003c!-- TOC Generator: https://luciopaiva.com/markdown-toc/ --\u003e\n\n* [Why?](#why)\n  * [Archival questions](#archival-questions)\n  * [Structural questions](#structural-questions)\n  * [Forensics questions](#forensics-questions)\n* [How?](#how)\n  * [Reference set](#reference-set)\n  * [Reference implementation](#reference-implementation)\n  * [Merkle trees](#merkle-trees)\n  * [Terminology](#terminology)\n  * [New folder attributes](#new-folder-attributes)\n  * [Sensitivity](#sensitivity)\n* [DROID](#droid)\n  * [DROID in Siegfried](#droid-in-siegfried)\n  * [DROID as an inspiration](#droid-as-an-inspiration)\n  * [Writing about sumfolder1](#writing-about-sumfolder1)\n* [Installation](#installation)\n* [Usage](#usage)\n  * [Demo output](#demo-output)\n  * [Use with a DROID csv](#use-with-a-droid-csv)\n  * [Outputting the reference CSV](#outputting-the-reference-csv)\n* [Previous work](#previous-work)\n* [License](#license)\n\n## Why?\n\nConventionally, checksums exist for files, they do not exist for directories.\nThey have no payload that can be summed together to calculate a digest/checksum.\n\nIf it were possible to create checksums for folders or a global checksum for a\ncollection of objects, it would become possible to ask the following:\n\n### Archival questions\n\n* What is the collection checksum for a given set of files and folders?\n* What is the checksum for a given folder?\n* Given a collection of objects online, am I looking at an authentic listing?\n* Have I downloaded a collection in its entirety?\n\n### Structural questions\n\n* Is file/folder hash(x) included in the collection set?\n* Is file/folder hash(y) (non-existent) part of the entire set?\n* Is file hash(x) part of folder(y) where the collection has arbitrary depth?\n* Where are duplicate checksums located within a collection?\n\n### Forensics questions\n\n* Has a digital object been removed from the collection?\n* Did the collection contain at least one empty directory?\n\n## How?\n\nGiven a set of file paths and existing checksums it is possible to compute a\nchecksum for a folder by creating a checksum of the given checksums.\n\nGiven checksum 1) `7c1f9f9a4d0ce9a72ee63f37a1b7f694` and checksum 2)\n`aececec0bc3f515039aec9e60c413cd3` an MD5 can be computed as:\n`82f9e9a4305714fffdd7932783980cbc`.\n\nWe can see this illustrated for a small collection as follows:\n\n```text\n📁 folder_1 82f9e9a4305714fffdd7932783980cbc\n    📄 checksum_1 7c1f9f9a4d0ce9a72ee63f37a1b7f694\n    📄 checksum_2 aececec0bc3f515039aec9e60c413cd3\n```\n\nIf we follow this approach through an entire directory structure we can create\nchecksums for all sub-directories and for the collection as a whole.\n\n### Reference set\n\nA reference set is provided with this repository: [reference set](reference/collection.7z).\n\nWe can iterate through the directory tree to create sets of directory checksums\nand a collection checksum: `52b94608dc70813aa88dae01176dc73b`.\n\nThe reference set then looks as follows:\n\n```text\n📁 collection 93778c524035d5d3e429a2fe43b7700a\n   📄 file_0001 14118ff9ad0344decb37960809b2f17a\n   📄 file_0000 8cfda2609b880a553759cd6200823f3b\n   📄 file_0002 a4501ee1a5c711ea9db78a800a24e830\n   📁 sub_dir_1 82301616d7e24f474dbe21de93af0a34\n      📄 file_empty d41d8cd98f00b204e9800998ecf8427e\n      📄 file_0003 dc7f828c5fe622925181d06edada350f\n      📄 file_0004 e3d90a4bf14a9b355f0e69ba08df522d\n      📁 sub_1_dir_1 1c7ba27edf1356d097a3f568032430c2\n         📄 file_0005 637a3fb7da1ab61d10e96336d9758416\n   📁 sub_dir_2 1ccb49edc4e873f1a8affd4bad5e9b90\n   📁 sub_dir_3 2a60541cede91a36e9dc5bab7a97dd6e\n      📁 sub_3_empty_1 db9d848b4f83ff3cb3faa4df0a59e3e1\n         📁 sub_3_empty_2 1ccb49edc4e873f1a8affd4bad5e9b90\n   📁 sub_dir_4 272d45767d534335163f220c1d40e559\n      📄 file_0006 2b43227486ec8744cd5d4c955d269743\n      📄 file_0007 c5a1973a70e08bf1eee13b8090f790ad\n      📄 file_0008 fdffe4dd2d39c7d9986dbf5c6ec5ad39\n   📁 sub_dir_5 d818d29b75f89a9b5d8d1c5a4c70dbbb\n      📁 sub_5_dir_1 82f9e9a4305714fffdd7932783980cbc\n         📄 file_0009 7c1f9f9a4d0ce9a72ee63f37a1b7f694\n         📄 file_0010 aececec0bc3f515039aec9e60c413cd3\n   📁 sub_dir_6 74be16979710d4c4e7c6647856088456\n      📄 file_empty d41d8cd98f00b204e9800998ecf8427e\n\n```\n\n### Reference implementation\n\nThe reference implementation for sumfolder1 does the following:\n\nFrom the lowest sub-directory in the tree:\n\n1. Check for sub-directories and add the checksums for these to a hash digest in\nalphabetical order by checksum.\n1. For files in the directory add these to the hash digest in alphabetical order\nby checksum.\n1. Create a digest for the list of checksums.\n\nRepeat, processing each folder backwards up to the top level.\n\n\u003e NB. If a folder is completely empty it is assigned a constant value\nchosen in the code: `2600_EMPTY_DIRECTORY`. This evaluates to an MD5 value of\n`1ccb49edc4e873f1a8affd4bad5e9b90`.\n\n### Merkle trees\n\nThe concept I have used here is based on Merkle trees and a loose understanding\nof techniques used in the block-chain and in the source control system GitHub.\n\nA good video summary of Merkle trees can be found on YouTube:\n\n* [Gaurav Sen on Merkle Trees][merkle-1]\n\nAnd a Python tutorial I found useful in starting this work:\n\n* [Dan Nolan on Merkle Trees][merkle-2]\n\nThe technique required for a directory tree is a little more convoluted than\nthat of a Merkle tree which uses binary nodes and evaluates checksums from left\nto right. I believe the implementation used for sumfolder1 is more closely\naligned to that of a \"Radix Tree\" or \"Patricia Tree\", however, this is to be\nexplored more.\n\n\u003e NB. A merkle tree can be used in its context for performance; sumfolder1 does\nnot yet have a performance use-case.\n\n### Terminology\n\nThe reference implementation introduces some terminology that helps with\nunderstanding the approach:\n\n* Active-tree: the side of a directory tree that we're querying about a given\nhash.\n* Non-active-tree: the tree at root node (Rn+1) that do not contain the digital\nobject that we're querying.\n* Root-node (Rn): the name of the top-level node, i.e. collection folder. This\nis either artificially created for a set of directories all at the same level,\nor exists as a function of the given collection set.\n\n### New folder attributes\n\nFolder objects need to be given additional attributes to enable the algorithm\nto work.\n\n* Folder-depth, so directories can be grouped and distinguished from\none-another by level in the hierarchy.\n* Hash, the goal of this tool is to enable a hash to be calculated for\nan entire collection.\n\n### Sensitivity\n\nI am trying to make this code as portable as possible, i.e. while it works with\nDROID-style reports today, it might also work with other checksum-based outputs\ntomorrow. Additionally, to be able to compare folder structures, this utility\nmay also work with DROID-style reports later on in a transfer workflow; at which\npoint, folders and files may have been renamed, but their content remains\nconsistent.\n\nTo calculate a single folder checksum we currently do the following:\n\n* If there are folders in the directory, order their hashes alphabetically\nand add to a list.\n* File checksums are then ordered alphabetically and added to the end of the\nlist.\n* The checksums are then summed together to create a new folder-level checksum.\n\n## DROID\n\nsumfolder1 uses the DROID format identification report to generate folder level\nchecksums.\n\nDROID can be found at The National Archives UK website:\n\n* [DROID @ The National Archives][droid-1]\n\n### DROID in Siegfried\n\nsumfolder1 can also be used with DROID compatible reports created by Siegfried\nusing a command such as follows:\n\n```bash\nsf --hash=md5 --droid \u003ccollection_folder\u003e\n```\n\n### DROID as an inspiration\n\nFile format reports provide a means of statically analyzing collections of\ndigital objects. A DROID report satisfies the pre-conditions required to create\nreliable folder- and collection-level checksums for digital collections:\n\n* A collection is static, i.e. unlikely to change.\n* Digital objects within the collection have checksums.\n\n\u003e NB: A collection need not be static to be analyzed but it is not the primary\nuse-case of this utility.\n\nMore information about the different uses for a file-format identification\nreport can be found in my paper in the Code4Lib journal.\n\n* [Fractal in detail: What information is in a file format identification report?][code4lib-1]\n\n### Writing about sumfolder1\n\nI wrote a blog describing the utility on the OPF website.\n\n* [What is the checksum of a directory?][opf-1]\n\n[opf-1]: https://openpreservation.org/blogs/what-is-the-checksum-of-a-directory/?q=1\n\n## Installation\n\nsumfolder1 is available on pypi and can be installed as follows:\n\n```bash\npip install -U sumfolder1\n```\n\n## Usage\n\nsumfolder1 has the following usage instructions:\n\n```text\nusage: sumfolder1.py [-h] [--csv CSV] [--demo] [--ref] [-v]\n\nCalculate checksums for folders in a collection of objects using a DROID format\nidentification report\n\noptions:\n  -h, --help          show this help message and exit\n  --csv CSV           Single DROID CSV to read.\n  --demo              Run demo queries and output a tree to demo.txt\n  --ref, --reference  Write reference set to stdout.\n  -v, --version       Return version information.\n```\n\n### Demo output\n\nsumfolder1's demo output can be invoked as follows:\n\n```bash\npython sumfolder1 --demo\n```\n\nJSON will be output to `stdout` describing a handful of queries generated using\nthe reference collection.\n\nAn visualization of the collection tree will be output (for demo purposes) to\n`stderr`.\n\n### Use with a DROID csv\n\nGiven a DROID csv the tool can be invoked as follows:\n\n```bash\npython sumfolder1 --csv \u003cdroid_csv_file\u003e\n```\n\n### Outputting the reference CSV\n\nA reference CSV can be output to `stdout`. Ideally it is piped to some other\nfile using a command such as follows:\n\n```bash\npython sumfolder1 --ref \u003e \u003coutput_file\u003e\n```\n\n## Previous work\n\nPrevious work in this area.\n\n* Check out [direct-dedupe-1] from Stefana Breitwieser which I was recently made\naware of via the BitCurator Forum 2024 and provides a shell script to calculate\nchecksums for sub-directories providing a very pragmatic way to help users\ndedupe at the folder level.\n\n## License\n\nThis work is license using: GNU GENERAL PUBLIC LICENSE Version 3.\n\n[droid-1]: https://www.nationalarchives.gov.uk/information-management/manage-information/preserving-digital-records/droid/\n[code4lib-1]: https://www.nationalarchives.gov.uk/information-management/manage-information/preserving-digital-records/droid/\n[merkle-1]: https://www.youtube.com/watch?v=qHMLy5JjbjQ\n[merkle-2]: https://medium.com/building-blocks-on-the-chain/learn-merkle-trees-by-programming-your-own-4f0438d40063\n[direct-dedupe-1]: https://github.com/stefanabreitwieser/direct-dedupe/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fross-spencer%2Fsumfolder1","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fross-spencer%2Fsumfolder1","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fross-spencer%2Fsumfolder1/lists"}