Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/justinshenk/simages
Find duplicates and similar images in a folder
https://github.com/justinshenk/simages
autoencoder duplicate-detection images preprocessing similarity-detection
Last synced: 3 days ago
JSON representation
Find duplicates and similar images in a folder
- Host: GitHub
- URL: https://github.com/justinshenk/simages
- Owner: JustinShenk
- License: mit
- Created: 2019-05-22T14:09:39.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2023-06-28T21:47:12.000Z (over 1 year ago)
- Last Synced: 2024-12-30T14:09:09.388Z (9 days ago)
- Topics: autoencoder, duplicate-detection, images, preprocessing, similarity-detection
- Language: Python
- Homepage: https://simages.readthedocs.io
- Size: 22.6 MB
- Stars: 23
- Watchers: 4
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# :monkey: simages:monkey:
[![PyPI version](https://badge.fury.io/py/simages.svg)](https://badge.fury.io/py/simages) [![Build Status](https://travis-ci.com/justinshenk/simages.svg?branch=master)](https://travis-ci.com/justinshenk/simages) [![Documentation Status](https://readthedocs.org/projects/simages/badge/?version=latest)](https://simages.readthedocs.io/en/latest/?badge=latest) [![DOI](https://zenodo.org/badge/188052094.svg)](https://zenodo.org/badge/latestdoi/188052094) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/justinshenk/simages/master?filepath=demo.ipynb)Find similar images within a dataset.
Useful for removing duplicate images from a dataset after scraping images with [google-images-download](https://github.com/hardikvasa/google-images-download).
The Python API returns `pairs, duplicates`, where pairs are the (ordered) closest pairs and distances is the
corresponding embedding distance.### Install
See the [installation docs](https://simages.readthedocs.io/en/latest/install.html) for all details.
```bash
pip install simages
```or install from source:
```bash
git clone https://github.com/justinshenk/simages
cd simages
pip install .
```To install the interactive interface, [install mongodb](https://docs.mongodb.com/manual/installation/) and use rather `pip install "simages[all]"`.
### Demo
1. Minimal command-line interface with ```simages-show```:
![simages_demo](images/simages_demo.gif)
2. Interactive image deletion with ```simages add/find```:
![simages_web_demo](images/screenshot_server.png)### Usage
Two interfaces exist:
1. minimal interface which plots the duplicates for visual inspection
2. mongodb + flask interface which allows interactive deletion [optional]
#### Minimal InterfaceIn your console, enter the directory with images and use `simages-show`:
```bash
$ simages-show --data-dir .
``````
usage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]
[--epochs EPOCHS] [--num-channels NUM_CHANNELS]
[--pairs PAIRS] [--zdim ZDIM] [-s]-h, --help show this help message and exit
--data-dir DATA_DIR, -d DATA_DIR
Folder containing image data
--show-train, -t Show training of embedding extractor every epoch
--epochs EPOCHS, -e EPOCHS
Number of passes of dataset through model for
training. More is better but takes more time.
--num-channels NUM_CHANNELS, -c NUM_CHANNELS
Number of channels for data (1 for grayscale, 3 for
color)
--pairs PAIRS, -p PAIRS
Number of pairs of images to show
--zdim ZDIM, -z ZDIM Compression bits (bigger generally performs better but
takes more time)
-s, --show Show closest pairs```
#### Web Interface [Optional]
Note: To install the web interface API, [install and run mongodb](https://docs.mongodb.com/manual/installation/) and use `pip install "simages[all]"` to install optional dependencies.
Add your pictures to the database (this will take some time depending on the number of pictures)
```
simages add
```A webpage will come up with all of the similar or duplicate pictures:
```
simages find
``````
Usage:
simages add ... [--db=] [--parallel=]
simages remove ... [--db=]
simages clear [--db=]
simages show [--db=]
simages find [--print] [--delete] [--match-time] [--trash=] [--db=] [--epochs=]
simages -h | --help
Options:
-h, --help Show this screen
--db= The location of the database or a MongoDB URI. (default: ./db)
--parallel= The number of parallel processes to run to hash the image
files (default: number of CPUs).
find:
--print Only print duplicate files rather than displaying HTML file
--delete Move all found duplicate pictures to the trash. This option takes priority over --print.
--match-time Adds the extra constraint that duplicate images must have the
same capture times in order to be considered.
--trash= Where files will be put when they are deleted (default: ./Trash)
--epochs= Epochs for training [default: 2]
```### Python APIs
#### Numpy array
```python
from simages import find_duplicates
import numpy as nparray_data = np.random.random(100, 3, 48, 48)# N x C x H x W
pairs, distances = find_duplicates(array_data)
```#### Folder
```python
from simages import find_duplicatesdata_dir = "my_images_folder"
pairs, distances = find_duplicates(data_dir)
```Default options for `find_duplicates` are:
```python
def find_duplicates(
input: Union[str or np.ndarray],
n: int = 5,
num_epochs: int = 2,
num_channels: int = 3,
show: bool = False,
show_train: bool = False,
**kwargs
):
"""Find duplicates in dataset. Either `array` or `data_dir` must be specified.Args:
input (str or np.ndarray): folder directory or N x C x H x W array
n (int): number of closest pairs to identify
num_epochs (int): how long to train the autoencoder (more is generally better)
show (bool): display the closest pairs
show_train (bool): show output every
z_dim (int): size of compression (more is generally better, but slower)
kwargs (dict): etc, passed to `EmbeddingExtractor`Returns:
pairs (np.ndarray): indices for closest pairs of images, n x 2 array
distances (np.ndarray): distances of each pair to each other
```#### `Embeddings` API
```python
from simages import Embeddings
import numpy as npN = 1000
data = np.random.random((N, 28, 28))
embeddings = Embeddings(data)# Access the array
array = embeddings.array # N x z (compression size)# Get 10 closest pairs of images
pairs, distances = embeddings.duplicates(n=5)```
```python
In [0]: pairs
Out[0]: array([[912, 990], [716, 790], [907, 943], [483, 492], [806, 883]])In [1]: distances
Out[1]: array([0.00148035, 0.00150703, 0.00158789, 0.00168699, 0.00168721])
```#### `EmbeddingExtractor` API
```python
from simages import EmbeddingExtractor
import numpy as npN = 1000
data = np.random.random((N, 28, 28))
extractor = EmbeddingExtractor(data, num_channels=1) # grayscale# Show 10 closest pairs of images
pairs, distances = extractor.show_duplicates(n=10)```
Class attributes and parameters:
```python
class EmbeddingExtractor:
"""Extract embeddings from data with models and allow visualization.Attributes:
trainloader (torch loader)
evalloader (torch loader)
model (torch.nn.Module)
embeddings (np.ndarray)"""
def __init__(
self,
input:Union[str, np.ndarray],
num_channels=None,
num_epochs=2,
batch_size=32,
show_train=True,
show=False,
z_dim=8,
**kwargs,
):
"""Inits EmbeddingExtractor with input, either `str` or `np.nd.array`, performs training and validation.
Args:
input (np.ndarray or str): data
num_channels (int): grayscale = 1, color = 3
num_epochs (int): more is better (generally)
batch_size (int): number of images per batch
show_train (bool): show intermediate training results
show (bool): show closest pairs
z_dim (int): compression size
kwargs (dict)
"""```
Specify tne number of pairs to identify with the parameter `n`.
### How it works*simages* uses a convolutional autoencoder with PyTorch and compares the latent representations with [closely](https://github.com/justinshenk/closely) :triangular_ruler:.
#### Dependencies
*simages* depends on
the following packages:- [closely](https://github.com/justinshenk/closely)
- [torch](https://pytorch.org)
- [torchvision](https://pytorch.org)
- scikit-learn
- matplotlibThe following dependencies are required for the interactive deleting interface:
- pymongodb
- fastcluster
- flask
- jinja2
- dnspython
- python-magic
- termcolor### Cite
If you use simages, please cite it:
```
@misc{justin_shenk_2019_3237830,
author = {Justin Shenk},
title = {justinshenk/simages: v19.0.1},
month = jun,
year = 2019,
doi = {10.5281/zenodo.3237830},
url = {https://doi.org/10.5281/zenodo.3237830}
}
```