https://github.com/tijn/dupfinder
find duplicates of a file by its content
https://github.com/tijn/dupfinder
Last synced: 8 months ago
JSON representation
find duplicates of a file by its content
- Host: GitHub
- URL: https://github.com/tijn/dupfinder
- Owner: tijn
- License: gpl-3.0
- Created: 2016-10-11T08:47:09.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-12-18T20:10:06.000Z (over 9 years ago)
- Last Synced: 2025-01-03T15:12:12.440Z (over 1 year ago)
- Language: Ruby
- Size: 18.6 KB
- Stars: 1
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Dupfinder
A tool for finding duplicate files.
# Usage
Give it a directory to search in (called the haystack) and one or more files to search for (called the needles).
find-dups [options] haystack needle...
-h, --help
-v, --[no-]verbose
-p, --print Print 'needle', 'hay' or 'separator'; should be a comma-separated list (no spaces!)
# Why?
I found some old backups and I was wondering which things I allready transferred to the new computer. There were many files so I needed a program to help me with this. I found some existing solutions that would start off by calculating a hash of every single file on my hard drive. Undoubtedly to create an index that later can be searched using the contents of the "needles". I decided that I could postpone the heavy calculation until I found two files with the exact same size. (If the size is different, the file contents must logically be different too.) This resulted in a huge speed increase since fetching the size of a file a really cheap operation compared to calculating the hash of some file's contents.
The program keeps a big list in memory of all the files you tell it to look for. We have gigabytes of memory these days and it proved not to be a problem at all for me.