An open API service indexing awesome lists of open source software.

https://github.com/PJDude/dude

Duplicates Detector is a cross-platform GUI utility for finding duplicate files, allowing you to delete or link them to save space. Duplicate files are displayed and processed on two synchronized panels for efficient and convenient operation.
https://github.com/PJDude/dude

cli deduplication duplicate duplicate-detection duplicate-files duplicates duplicates-removal easy easy-to-use easyui gui gui-application python python3 sha1 threads tkinter utility utility-application

Last synced: 3 months ago
JSON representation

Duplicates Detector is a cross-platform GUI utility for finding duplicate files, allowing you to delete or link them to save space. Duplicate files are displayed and processed on two synchronized panels for efficient and convenient operation.

Awesome Lists containing this project

README

        

# DUDE (DUplicates DEtector)

A cross-platform GUI utility for finding duplicated files, delete or link them to save space.

## Features:
- Scanning for duplicate files in **multiple designated folders** (up to 8). Optional **"Cross paths"** display mode
- Optional **command line parameters** to start scanning immediately or integrate **Dude** with your favorite file manager
- Two **synchronized** panels:
- groups of duplicates
- directory of selected file
- Two stage processing:
- interactive marking of files with multiple criteria
- taking action on marked files (Move to Trash/Recycle Bin, delete, hard-link, soft-link, create windows .lnk shortcut file)
- Support for **regular expressions** or **glob** expressions syntax
- Searching for duplicates based on the **hash** of the file content. Different filenames or extensions do not affect the search results
- Works on **Linux** and **Windows**

### 💥 Major news in 2.x version:
- **Images similarity mode**, with caching, sensitivity parameters and rotated images detection
- **preview window** for images and text files
- **"Same directory"** display mode

## Why another anti-duplicate application ?
- Because you need to see the context of removed files, and use such application clearly,safely and easily.

## Screenshots:

#### GUI usage example:
![image info](./info/dude.gif)
#### immediate scanning start with CLI parameters example:
![image info](./info/cmd.gif)
#### settings dialog:
![image info](./info/settings.png)

## Download:
Portable executable packages created with [PyInstaller](https://pyinstaller.org/en/stable) for **Linux** and **Windows** can be downloaded from the [Releases](https://github.com/PJDude/dude/releases) site. At the same time, separate builds are created with the [Nuitka](https://github.com/Nuitka/Nuitka) compiler.

## [Review on SOFTPEDIA](https://www.softpedia.com/get/System/File-Management/Dude-DUplicates-DEtector.shtml)

## [Review on MAJORGEEKS](https://www.majorgeeks.com/files/details/dude_(duplicates_detector).html)

## General usage:
- Scan for duplicate files
- Mark files for processing
- Take action on marked files (delete, softlink, hardlink)

## Usage tips:
- Use keyboard shortcuts. All are described in the context menus of the main panels. As a general rule, actions with **Ctrl** key apply to all files/groups, without **Ctrl** work locally. Start by utilizing just **Tab**, **arrows**, **F5**, **space**, and **Delete**.
- Sometimes it is more efficient to operate on entire folders than on CRC groups, so don't ignore lower panel, use **Tab** and **F5** (**F6**)
- Narrow down the scanning area, exclude from scanning unnecessary folders (e.g. Windows system folder which is full of duplicates), you can always add multiple (up to 8) independent scan paths
- Modify on your **PATH** environmental variable to to point to the **Dude** binary for more convenient and faster command line operation
- The performance of the scanning process (or of any other software that requires frequent or extensive access to files in general) may be degraded by the **atime** attribute of the scanned file system. Disabling it on Linux file systems may be done usually by applying the **noatime** attribute in **fstab**, on Windows/NTFS it is the **disablelastaccess** option available with the **fsutil** command or by modifying the registry.

## Supported platforms:
- Linux
- Windows (10,11)

## Command line examples:
* Start scanning for duplicates in current directory:
```
dude .
```
* Start scanning in specified directories:
```
dude c:\order d:\mess
```
* Generate csv with report, exclude some paths:
```
dude ~ --exclude "*.git/*" --csv result.csv ; note the quotation marks on asterisks
dude.exe c:\ --exclude *windows* --csv result.csv
```
* check full set of available parameters:
```
dude --help
```
## False positives issue
[Reference to potential problems with Windows Defender and other antivirus programs](https://github.com/PJDude/dude/discussions/9).

## Portability
**Dude** writes log files, configuration and cache files in runtime. Default location for these files is **dude.data** folder, created next to the **dude executable** file. If there are no write access rights to such folder, platform-specific folders are used for cache, settings and logs (provided by **appdirs** module). You can use --appdirs command line switch to force that behavior even when **dude.data** is accessible.

## Dude is BIG 💥
Well, unfortunately, the 2.x version has much larger distribution package than v1. This is mainly because necessity of importing [NumPy](https://numpy.org/) and [SciPy](https://scipy.org/) packages for image hashing and clustering. I apologize for the inconvenience.

## Technical information
- Scanning process analyzes selected paths and groups files with the same size. **Dude** compare files by calculated **SHA1** hash of file content. CRC calculation is done in separate threads for every identified device (drive). Number of active threads is limited by available CPU cores. Aborting of CRC calculation gives only partial results - not all files may be identified as duplicates. Restarted scanning process will use cached data. The CRC is always calculated based on the entire contents of the file.
- scanning (CRC calculation to be precise) is done in **specific order**, that try to identify duplicates in folders with biggest potential duplicates. In case of huge filesystems, when scan is aborted, partial results are more useful then.
- Calculated CRC is stored in **internal cache** which allows re-use it in future operation and speedup of searching of duplicates (e.g. with different set of search paths). Key of cache database is pair of inode of file and file modification time stored separately for every device-id, so any file modification or displacement will result in invalidation of obsolete data and recalculation of CRC.
- Scanning or marking files does not cause any filesystem change. Any file deletion or linking needs confirmation and is logged.
- Just before files processing, state of files (ctime) is compared with stored data. In case of inconsistency (state of files was changed somehow during operation between scanning/CRC calculation and files processing) action is aborted and data invalidated.
- **Dude** is written in **python3** with **Tkinter** and packed with [PyInstaller](https://pyinstaller.org/en/stable) to portable distribution. GitHub release build for linux platform is done in **ubuntu-20.04** container. In case of **glibc** incompatibility it is always possible to build Your own binary (**pyinstaller.run.sh**) or run python script (**dude.py**)
- **Dude** for **windows** is build as two binary executables: **dude.exe** and **dudecmd.exe**. They should be saved in the same path. **dudecmd.exe** is basically only to respond to the console to --help parameter or for passing command line parameters (if correct) to **dude.exe**. **dude.exe** will also accept parameters but will not respond to the console. **dudecmd.exe** will leave windows command line window open for time of operation.

- ***Soft links*** to **directories** are skipped during the scanning process. ***Soft links*** to **files** are ignored during scanning. Both appear in the bottom "folders" pane.
- ***Hard links*** (files with stat.st_nlink>1) currently are ignored during the scanning process and will not be identified as duplicates (within the same inode obviously, as with other inodes). No action can be performed on them. They will only appear in the bottom "folders" pane. This may change in the future versions.
- the "delete" action moves files to **Recycle Bin / Trash** or deletes them permanently according to option settings.
- 💥 Image similarity mode is based on the libraries: [PIL](https://python-pillow.org/), [ImageHash](https://pypi.org/project/ImageHash/), and the [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) data clustering algorithm from [scikit-learn](https://scikit-learn.org/stable/index.html). For maximum performance, image hashing utilizes all available CPU cores with multiple threads and the DBSCAN algorithm implementation is multi-threaded internally. Key parameters of clustering are available to set on the scan dialog.

###### Manual build (linux):
```
pip install -r requirements.txt
./scripts/icons.convert.sh
./scripts/version.gen.sh
./scripts/pyinstaller.run.sh
```
###### Manual build (windows):
```
pip install -r requirements.txt
.\scripts\icons.convert.bat
.\scripts\version.gen.bat
.\scripts\pyinstaller.run.bat
```
###### Manual running of python script:
```
pip install -r requirements.txt
./scripts/icons.convert.sh
./scripts/version.gen.sh

python ./src/dude.py
```

## Licensing
- **dude** is licensed under **[MIT license](./LICENSE)**

### Check out my [homepage](https://github.com/PJDude) for other projects.