{"id":15646877,"url":"https://github.com/edsu/dedoop","last_synced_at":"2025-04-30T12:35:19.749Z","repository":{"id":62567350,"uuid":"128249595","full_name":"edsu/dedoop","owner":"edsu","description":"recursively deduplicate a directory and write its contents to a new directory while remembering the old paths","archived":false,"fork":false,"pushed_at":"2020-09-22T11:59:51.000Z","size":930,"stargazers_count":47,"open_issues_count":1,"forks_count":0,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-10-03T12:19:47.519Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/edsu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-04-05T18:45:17.000Z","updated_at":"2024-05-14T07:47:24.000Z","dependencies_parsed_at":"2022-11-03T16:30:30.246Z","dependency_job_id":null,"html_url":"https://github.com/edsu/dedoop","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fdedoop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fdedoop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fdedoop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fdedoop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/edsu","download_url":"https://codeload.github.com/edsu/dedoop/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221114147,"owners_count":16758584,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-03T12:15:33.103Z","updated_at":"2024-10-22T16:20:26.306Z","avatar_url":"https://github.com/edsu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## dedoop\n\n[![Build Status](https://secure.travis-ci.org/edsu/dedoop.png)](http://travis-ci.org/edsu/dedoop)\n\nIn [digital preservation] work you sometimes may find yourself accepting a disk\nor random assortment of files, and want to examine all of them looking for\nduplicates and copy them to a new location in a uniform way, while preserving\nthe original paths as metadata to help you process the data. Ok, maybe this is a\nbit of niche use case, but this is what *dedoop* was created for.\n\n*dedoop* will recursively read a source directory of files and write them out to\na new target directory or bucket in the cloud using the files's SHA256 checksum\nas the filename. If a given file occurs more than once in the source\ndirectory it will only be written once to the target location. File metadata\nsuch as the media type and original file name will be persisted in a JSON file\nthat is output at the end of the process. In the case of writing to the cloud,\nobject metadata will be used to store this information.\n\n## Install\n\nInstall Python 3 and:\n\n```\n% pip3 install dedoop\n```\n\n## Usage \n\n### Add to Storage\n\nTo add a directory of data to the storage location you can:\n\n    % dedoop add path/to/source path/to/target\n\nSo for example if the source directory looks like this:\n\n    source\n    ├── a.jpg\n    ├── a.png\n    ├── b.jpg\n    └── c\n        ├── a.jpg\n        └── b.jpg\n\nThe resulting target could look like this (assuming the files of the same name\nhad the same contents that hashed to these values):\n\n    target\n    ├── 1e89b90b5973baad2e6c3294ffe648ff53ab0b9d75188e9fbb8b38deb9ba3341.png\n    ├── 45d257c93e59ec35187c6a34c8e62e72c3e9cfbb548984d6f6e8deb84bac41f4.jpg\n    └── b6df8058fa818acfd91759edffa27e473f2308d5a6fca1e07a79189b95879953.jpg\n\n## Add to the Cloud\n\nYou can also write files to any cloud storage provider that is [supported] by [libcloud],\nsuch as Amazon S3, Google Cloud Storage, etc.\n\n## Limit by File Extension\n\nIf you like you can limit the types of files that are added by using the\n*--extensions* command line option and giving it a comma separated list of file\nextensions to include. All non-matching files (case insensitive) will be\nignored.\n\n    % dedoop add --extensions jpg,png path/to/source path/to/target\n\n## List Cloud Files\n\nIts easy to list files on the file system. But its more difficult to see what's\nin the cloud--especially with the  metadata dedoop has attached to each object.\nThe *list* command will do that for you.\n\n    % dedoop ls s3://my-storage-location/\n\n## Logging\n\nIf you use *--verbose* you will see log messages on the console about what is\nhappening. You can optionally send these messages to a log file of your choosing\nusing the *--log* option.\n\n[digital preservation]: https://en.wikipedia.org/wiki/Digital_preservation\n[libcloud]: https://libcloud.readthedocs.io\n[supported]: https://libcloud.readthedocs.io/en/stable/storage/supported_providers.html\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedsu%2Fdedoop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fedsu%2Fdedoop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedsu%2Fdedoop/lists"}