https://github.com/umstek/dupkiller
Slow, but more reliable duplicate files cleaner.
https://github.com/umstek/dupkiller
cleaner duplicate-files storage
Last synced: 7 months ago
JSON representation
Slow, but more reliable duplicate files cleaner.
- Host: GitHub
- URL: https://github.com/umstek/dupkiller
- Owner: umstek
- License: mit
- Created: 2017-01-04T06:35:13.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2021-03-14T10:33:46.000Z (over 4 years ago)
- Last Synced: 2024-11-17T15:21:11.257Z (11 months ago)
- Topics: cleaner, duplicate-files, storage
- Language: C#
- Size: 372 KB
- Stars: 1
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DupKiller
Slow, but more reliable duplicate files cleaner.A perfect duplicate files finder would have to compare the content of each file with the
content of everything else. This has a O(n^2) complexity, and since the files can be
large, this is not possible and will take a long time to complete. If a dictionary i.e.:
a hashmap can be created with the content as keys and paths as values, we can identify
duplicate files quickly. But this is neither a viable option because all the files will
have to be stored in memory and a dictionary will not work properly with that.
Most of the current software use file size and extension to find the duplicate files but
this can be inaccurate in various instances e.g.: uncompressed images that are the same
size.
So, the best option is to consider multiple factors, and allow user to select whether to
use them. These factors should include a hash function. If the user suspects a hash
collision, there should be a way to raw compare the files. Since we have filtered
possible duplicates by various means, such incident can rarely occur.Group by extension (optional default yes), file name (optional default no),
file size, shorter hash (MD5), longer hash (SHA512)