https://github.com/hekmon/deduper

Analyse 2 paths to found identical files and hard link them to save space
https://github.com/hekmon/deduper

dedup deduplicate deduplication deduplicator duplicate-files filesystem hardlink hardlinking hardlinks linux saving scan scanner space tool utility

Last synced: 3 months ago
JSON representation

Analyse 2 paths to found identical files and hard link them to save space

Host: GitHub
URL: https://github.com/hekmon/deduper
Owner: hekmon
License: mit
Created: 2023-11-06T08:00:54.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-04-19T11:52:28.000Z (about 1 year ago)
Last Synced: 2025-01-11T12:37:41.160Z (5 months ago)
Topics: dedup, deduplicate, deduplication, deduplicator, duplicate-files, filesystem, hardlink, hardlinking, hardlinks, linux, saving, scan, scanner, space, tool, utility
Language: Go
Homepage:
Size: 151 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# deduper

Analyse 2 paths on the same file system to found identical files and hard link them to save space.

## How it works

* Indexing: both paths will be analyzed and the structure of the directories tree and their corresponding inodes mapped in memory (files & directories)
* Then the structure of the A path will be walked and for each regular file, the mapped memory structure of path B will be analyzed to find potential candidates
* First a list of all files in B having the exact same size of the A files analyzed will be compiled (empty files will be ignored)
* Then this list will be pruned based on several criterias
* Candidates in B that are already hardlinks of the reference A file will be removed from the list
* Files that do not have the same inode metadata (ownership [uid, gid] and file mode) will be removed from the candidates list to avoid breaking potential current access to these files (as hardlinks share the same metadata by design)
* Unless the `-force` flag is set, in that case candidates are kept (but will have their metadata changed once hardlinking is done)
* For candidates that are still on the list, a SHA256 checksum will be performed to ensure they have indeed the same content as the reference file in A currently being processed
* For candidates that have passed all the tests and are still on the candidates list:
* if the `-apply` flag has been set
* They will be removed (in order to free their path)
* Reffile in A will be hard linked to the path that the B candidate had, making it available once again but dedupped with A this time
* if the `-apply` flag has not been set
* A reporting will be printed of what would have been done (and saved) with the flag on

## Usage

```
Usage of ./deduper:
-apply
By default deduper run in dry run mode: set this flag to actually apply changes
-debug
Show debug logs during the analysis phase
-dirA string
Referential directory
-dirB string
Second directory to compare dirA against
-force
Dedup files that have the same content even if their inode metadata (ownership and mode) are not the same
-minSize string
Set the minimum size a file must have to be kept for analysis (ex: 100MiB)
-workers int
Set the maximum numbers of workers that will perform IO tasks (default 6)
```

### Example

```bash
./deduper -minSize 10MiB -workers 8 -dirA "$(pwd)/example/dirA" -dirB "$(pwd)/example/dirB" -apply
```

![Example GIF](https://github.com/hekmon/deduper/raw/main/example/example.gif)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hekmon/deduper

Awesome Lists containing this project

README