https://github.com/tijn/dupfinder

find duplicates of a file by its content
https://github.com/tijn/dupfinder

Last synced: 8 months ago
JSON representation

find duplicates of a file by its content

Host: GitHub
URL: https://github.com/tijn/dupfinder
Owner: tijn
License: gpl-3.0
Created: 2016-10-11T08:47:09.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2016-12-18T20:10:06.000Z (over 9 years ago)
Last Synced: 2025-01-03T15:12:12.440Z (over 1 year ago)
Language: Ruby
Size: 18.6 KB
Stars: 1
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

# Dupfinder

A tool for finding duplicate files.

# Usage

Give it a directory to search in (called the haystack) and one or more files to search for (called the needles).

find-dups [options] haystack needle...

-h, --help
-v, --[no-]verbose
-p, --print Print 'needle', 'hay' or 'separator'; should be a comma-separated list (no spaces!)

# Why?

I found some old backups and I was wondering which things I allready transferred to the new computer. There were many files so I needed a program to help me with this. I found some existing solutions that would start off by calculating a hash of every single file on my hard drive. Undoubtedly to create an index that later can be searched using the contents of the "needles". I decided that I could postpone the heavy calculation until I found two files with the exact same size. (If the size is different, the file contents must logically be different too.) This resulted in a huge speed increase since fetching the size of a file a really cheap operation compared to calculating the hash of some file's contents.

The program keeps a big list in memory of all the files you tell it to look for. We have gigabytes of memory these days and it proved not to be a problem at all for me.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tijn/dupfinder

Awesome Lists containing this project

README