https://github.com/edsu/dedoop

recursively deduplicate a directory and write its contents to a new directory while remembering the old paths
https://github.com/edsu/dedoop

Last synced: about 1 year ago
JSON representation

recursively deduplicate a directory and write its contents to a new directory while remembering the old paths

Host: GitHub
URL: https://github.com/edsu/dedoop
Owner: edsu
Created: 2018-04-05T18:45:17.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2020-09-22T11:59:51.000Z (over 5 years ago)
Last Synced: 2024-10-03T12:19:47.519Z (over 1 year ago)
Language: Python
Homepage:
Size: 908 KB
Stars: 47
Watchers: 7
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## dedoop

[![Build Status](https://secure.travis-ci.org/edsu/dedoop.png)](http://travis-ci.org/edsu/dedoop)

In [digital preservation] work you sometimes may find yourself accepting a disk
or random assortment of files, and want to examine all of them looking for
duplicates and copy them to a new location in a uniform way, while preserving
the original paths as metadata to help you process the data. Ok, maybe this is a
bit of niche use case, but this is what *dedoop* was created for.

*dedoop* will recursively read a source directory of files and write them out to
a new target directory or bucket in the cloud using the files's SHA256 checksum
as the filename. If a given file occurs more than once in the source
directory it will only be written once to the target location. File metadata
such as the media type and original file name will be persisted in a JSON file
that is output at the end of the process. In the case of writing to the cloud,
object metadata will be used to store this information.

## Install

Install Python 3 and:

```
% pip3 install dedoop
```

## Usage

### Add to Storage

To add a directory of data to the storage location you can:

% dedoop add path/to/source path/to/target

So for example if the source directory looks like this:

source
├── a.jpg
├── a.png
├── b.jpg
└── c
├── a.jpg
└── b.jpg

The resulting target could look like this (assuming the files of the same name
had the same contents that hashed to these values):

target
├── 1e89b90b5973baad2e6c3294ffe648ff53ab0b9d75188e9fbb8b38deb9ba3341.png
├── 45d257c93e59ec35187c6a34c8e62e72c3e9cfbb548984d6f6e8deb84bac41f4.jpg
└── b6df8058fa818acfd91759edffa27e473f2308d5a6fca1e07a79189b95879953.jpg

## Add to the Cloud

You can also write files to any cloud storage provider that is [supported] by [libcloud],
such as Amazon S3, Google Cloud Storage, etc.

## Limit by File Extension

If you like you can limit the types of files that are added by using the
*--extensions* command line option and giving it a comma separated list of file
extensions to include. All non-matching files (case insensitive) will be
ignored.

% dedoop add --extensions jpg,png path/to/source path/to/target

## List Cloud Files

Its easy to list files on the file system. But its more difficult to see what's
in the cloud--especially with the metadata dedoop has attached to each object.
The *list* command will do that for you.

% dedoop ls s3://my-storage-location/

## Logging

If you use *--verbose* you will see log messages on the console about what is
happening. You can optionally send these messages to a log file of your choosing
using the *--log* option.

[digital preservation]: https://en.wikipedia.org/wiki/Digital_preservation
[libcloud]: https://libcloud.readthedocs.io
[supported]: https://libcloud.readthedocs.io/en/stable/storage/supported_providers.html

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/edsu/dedoop

Awesome Lists containing this project

README