https://github.com/fr3fou/decker

check for duplicate images and find the best one in a folder with an easy to use cli
https://github.com/fr3fou/decker

go golang hacktoberfest perceptual-hashing phash

Last synced: 9 months ago
JSON representation

check for duplicate images and find the best one in a folder with an easy to use cli

Host: GitHub
URL: https://github.com/fr3fou/decker
Owner: fr3fou
License: mit
Created: 2019-11-15T11:34:25.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2021-10-29T16:18:48.000Z (over 4 years ago)
Last Synced: 2024-07-30T19:50:02.794Z (almost 2 years ago)
Topics: go, golang, hacktoberfest, perceptual-hashing, phash
Language: Go
Homepage:
Size: 89.8 KB
Stars: 10
Watchers: 3
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# decker

👯 Check for duplicate images and find the best one in a folder with an easy to use CLI and GUI!

## Building

```sh
$ make # Builds and installs
$ make build # Builds app with go
$ make install # Installs the already built app (only use after building first)
$ make uninstall # Removes app from path
```
## Usage

Console usage
```console
-h
shows this help page
-d string
path to the directory which contains the images
-dir string
path to the directory which contains the images
-t int
threshold amount (default 5)
-threshold int
threshold amount (default 5)
```

## References

- https://cotin.tech/Algorithm/ImageSimilarityComparison/
- http://bertolami.com/index.php?engine=blog&content=posts&detail=perceptual-hashing
- https://content-blockchain.org/research/testing-different-image-hash-functions/
- https://iq.opengenus.org/burkhard-keller-tree/
- https://towardsdatascience.com/symspell-vs-bk-tree-100x-faster-fuzzy-string-search-spell-checking-c4f10d80a078
- https://signal-to-noise.xyz/post/bk-tree/
- https://www.geeksforgeeks.org/bk-tree-introduction-implementation
- https://medium.com/@wolfgarbe/fast-approximate-string-matching-with-large-edit-distances-in-big-data-2015-9174a0968c0b
- https://towardsdatascience.com/symspell-vs-bk-tree-100x-faster-fuzzy-string-search-spell-checking-c4f10d80a078

## TODO:

- [x] Implement sequential first, then concurrent - on average, the concurrent version is ~3.5x faster
- [x] ~~Think of data structure that can hold the best quality image and the respective duplicates as children~~
- [x] Reimplement it
- [ ] Tests
- [x] Find `IsBest` field based on resolution
- [x] CLI
- [ ] Handle rotated images
- [ ] Implement pHash here instead of relying on thirdparty library
- [ ] GUI in ~~[zserge/lorca](https://github.com/zserge/lorca)~~ https://github.com/AllenDang/giu
- [ ] Delete dupes / prompt / preview

## Data structure

A data structure will have to be created that has the following properties:

- needs to store:
- path to each image
- the hash of each image (?) - maybe we can just store the hamming distance? (we will only need the %)
- the BEST image (in terms of resolution)
- the duplicate images
- needs to have an array of it

### Current solution

```go
type Node struct {
Image image.Image
Path string
Hash *goimagehash.ImageHash
Children []Node
}

type Graph struct {
Threshold int
Nodes []Node
}
// where Graph holds all of the unique images
// and Node holds all the duplciates of the images
```

### Old Solution

This is the old solution that was rewritten

How about this?
The idea is that in the first array, we are going to hold
ALL of `decker.Image` - wrapping the normal `image.Image`, while adding

- the path
- the hash
- the ID of a bucket (originally set to 0)
- IsBest field

After the map has been created, we can lazily go over each entry and find the correct `IsBest` image.
We'll use the resolution of the images to accomplish this.

```go
// first step is to generate an array of all images but adding their hash and path as well
[]decker.Image{
decker.Image{
Hash: 0xaf0912bf, // the hash isn't directly stored like this, it's stored in the goimagehash struct, which has a field `.hash`
Path: "~/Pictures/Wallpapers/Foo",
IsBest: true,
ID: -1,
},
decker.Image{
Hash: 0x98adf2bf,
Path: "~/Pictures/Wallpapers/Foo1",
IsBest: false,
ID: -1,
},
decker.Image{
Hash: 0x1003001,
Path: "~/Pictures/Wallpapers/Wow",
IsBest: false,
ID: -1,
},
}

// second step is to create a map of all duplicate images combined into an array

// 0xaf0912bf and 0x98adf2bf are duplicates of one another, they also have the same ID
// hence why they are added on the `1` key of the map

// ID -> siblings array
// The key is an ID
// The value is a bucket of duplicate images
map[uint64][]decker.Image

1 -> []decker.Image{
decker.Image{
Hash: 0xaf0912bf,
IsBest: false,
ID: 1,
},
decker.Image{
Hash: 0x98adf2bf,
IsBest: false,
ID: 1,
},
// ... any other duplicates of `1`
}
2 -> []decker.Image{
decker.Image{
Hash: 0x1003001,
IsBest: false,
ID: 2,
},
// ... any duplicates of `2`
// if there aren't any, this entry gets deleted
}

// third step is to go over every element in the map and then every image
// and find the best image based on resolution

// TODO
```

Maybe there is some way to optimize this to do more operations at the same time? Right now this involves going over the images 3 times

## Shoutouts

[mlvzk](http://github.com/mlvzk/) - helping out with concurrency and general lib structure

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fr3fou/decker

Awesome Lists containing this project

README