https://github.com/edsu/dedoop
recursively deduplicate a directory and write its contents to a new directory while remembering the old paths
https://github.com/edsu/dedoop
Last synced: about 1 year ago
JSON representation
recursively deduplicate a directory and write its contents to a new directory while remembering the old paths
- Host: GitHub
- URL: https://github.com/edsu/dedoop
- Owner: edsu
- Created: 2018-04-05T18:45:17.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2020-09-22T11:59:51.000Z (over 5 years ago)
- Last Synced: 2024-10-03T12:19:47.519Z (over 1 year ago)
- Language: Python
- Homepage:
- Size: 908 KB
- Stars: 47
- Watchers: 7
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## dedoop
[](http://travis-ci.org/edsu/dedoop)
In [digital preservation] work you sometimes may find yourself accepting a disk
or random assortment of files, and want to examine all of them looking for
duplicates and copy them to a new location in a uniform way, while preserving
the original paths as metadata to help you process the data. Ok, maybe this is a
bit of niche use case, but this is what *dedoop* was created for.
*dedoop* will recursively read a source directory of files and write them out to
a new target directory or bucket in the cloud using the files's SHA256 checksum
as the filename. If a given file occurs more than once in the source
directory it will only be written once to the target location. File metadata
such as the media type and original file name will be persisted in a JSON file
that is output at the end of the process. In the case of writing to the cloud,
object metadata will be used to store this information.
## Install
Install Python 3 and:
```
% pip3 install dedoop
```
## Usage
### Add to Storage
To add a directory of data to the storage location you can:
% dedoop add path/to/source path/to/target
So for example if the source directory looks like this:
source
├── a.jpg
├── a.png
├── b.jpg
└── c
├── a.jpg
└── b.jpg
The resulting target could look like this (assuming the files of the same name
had the same contents that hashed to these values):
target
├── 1e89b90b5973baad2e6c3294ffe648ff53ab0b9d75188e9fbb8b38deb9ba3341.png
├── 45d257c93e59ec35187c6a34c8e62e72c3e9cfbb548984d6f6e8deb84bac41f4.jpg
└── b6df8058fa818acfd91759edffa27e473f2308d5a6fca1e07a79189b95879953.jpg
## Add to the Cloud
You can also write files to any cloud storage provider that is [supported] by [libcloud],
such as Amazon S3, Google Cloud Storage, etc.
## Limit by File Extension
If you like you can limit the types of files that are added by using the
*--extensions* command line option and giving it a comma separated list of file
extensions to include. All non-matching files (case insensitive) will be
ignored.
% dedoop add --extensions jpg,png path/to/source path/to/target
## List Cloud Files
Its easy to list files on the file system. But its more difficult to see what's
in the cloud--especially with the metadata dedoop has attached to each object.
The *list* command will do that for you.
% dedoop ls s3://my-storage-location/
## Logging
If you use *--verbose* you will see log messages on the console about what is
happening. You can optionally send these messages to a log file of your choosing
using the *--log* option.
[digital preservation]: https://en.wikipedia.org/wiki/Digital_preservation
[libcloud]: https://libcloud.readthedocs.io
[supported]: https://libcloud.readthedocs.io/en/stable/storage/supported_providers.html