{"id":23920500,"url":"https://github.com/justinshenk/simages","last_synced_at":"2025-10-11T13:01:41.509Z","repository":{"id":48833983,"uuid":"188052094","full_name":"JustinShenk/simages","owner":"JustinShenk","description":"Find duplicates and similar images in a folder","archived":false,"fork":false,"pushed_at":"2023-06-28T21:47:12.000Z","size":23682,"stargazers_count":23,"open_issues_count":2,"forks_count":3,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-12-30T14:09:09.388Z","etag":null,"topics":["autoencoder","duplicate-detection","images","preprocessing","similarity-detection"],"latest_commit_sha":null,"homepage":"https://simages.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JustinShenk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-05-22T14:09:39.000Z","updated_at":"2024-10-27T03:56:36.000Z","dependencies_parsed_at":"2023-11-28T17:18:02.245Z","dependency_job_id":null,"html_url":"https://github.com/JustinShenk/simages","commit_stats":{"total_commits":127,"total_committers":4,"mean_commits":31.75,"dds":"0.20472440944881887","last_synced_commit":"b5b9dbbf16333b037b8335e58e4c65e5bc91e27f"},"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustinShenk%2Fsimages","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustinShenk%2Fsimages/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustinShenk%2Fsimages/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustinShenk%2Fsimages/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JustinShenk","download_url":"https://codeload.github.com/JustinShenk/simages/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":232608315,"owners_count":18549524,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["autoencoder","duplicate-detection","images","preprocessing","similarity-detection"],"created_at":"2025-01-05T15:49:41.648Z","updated_at":"2025-10-11T13:01:41.216Z","avatar_url":"https://github.com/JustinShenk.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# :monkey: simages:monkey:\n[![PyPI version](https://badge.fury.io/py/simages.svg)](https://badge.fury.io/py/simages) [![Build Status](https://travis-ci.com/justinshenk/simages.svg?branch=master)](https://travis-ci.com/justinshenk/simages)  [![Documentation Status](https://readthedocs.org/projects/simages/badge/?version=latest)](https://simages.readthedocs.io/en/latest/?badge=latest) [![DOI](https://zenodo.org/badge/188052094.svg)](https://zenodo.org/badge/latestdoi/188052094) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/justinshenk/simages/master?filepath=demo.ipynb)\n\n\nFind similar images within a dataset. \n\nUseful for removing duplicate images from a dataset after scraping images with [google-images-download](https://github.com/hardikvasa/google-images-download).\n\nThe Python API returns `pairs, duplicates`, where pairs are the (ordered) closest pairs and distances is the \ncorresponding embedding distance.\n\n### Install\n\nSee the [installation docs](https://simages.readthedocs.io/en/latest/install.html) for all details. \n\n```bash\npip install simages\n```\n\nor install from source:\n\n```bash\ngit clone https://github.com/justinshenk/simages\ncd simages\npip install .\n```\n\nTo install the interactive interface, [install mongodb](https://docs.mongodb.com/manual/installation/) and use rather `pip install \"simages[all]\"`.\n\n### Demo\n\n1. Minimal command-line interface with ```simages-show```:\n\n![simages_demo](images/simages_demo.gif)\n\n2. Interactive image deletion with ```simages add/find```:\n![simages_web_demo](images/screenshot_server.png)\n\n### Usage\n\nTwo interfaces exist:\n\n1. minimal interface which plots the duplicates for visual inspection\n2. mongodb + flask interface which allows interactive deletion [optional]\n \n#### Minimal Interface\n\nIn your console, enter the directory with images and use `simages-show`:\n\n```bash\n$ simages-show --data-dir .\n```\n\n```\nusage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]\n                    [--epochs EPOCHS] [--num-channels NUM_CHANNELS]\n                    [--pairs PAIRS] [--zdim ZDIM] [-s]\n\n  -h, --help            show this help message and exit\n  --data-dir DATA_DIR, -d DATA_DIR\n                        Folder containing image data\n  --show-train, -t      Show training of embedding extractor every epoch\n  --epochs EPOCHS, -e EPOCHS\n                        Number of passes of dataset through model for\n                        training. More is better but takes more time.\n  --num-channels NUM_CHANNELS, -c NUM_CHANNELS\n                        Number of channels for data (1 for grayscale, 3 for\n                        color)\n  --pairs PAIRS, -p PAIRS\n                        Number of pairs of images to show\n  --zdim ZDIM, -z ZDIM  Compression bits (bigger generally performs better but\n                        takes more time)\n  -s, --show            Show closest pairs\n\n```\n\n#### Web Interface [Optional]\n\nNote: To install the web interface API, [install and run mongodb](https://docs.mongodb.com/manual/installation/) and use `pip install \"simages[all]\"` to install optional dependencies.\n\nAdd your pictures to the database (this will take some time depending on the number of pictures)\n\n```\nsimages add \u003cimages_folder_path\u003e\n```\n\nA webpage will come up with all of the similar or duplicate pictures:\n```\nsimages find \u003cimages_folder_path\u003e\n```\n\n```\nUsage:\n    simages add \u003cpath\u003e ... [--db=\u003cdb_path\u003e] [--parallel=\u003cnum_processes\u003e]\n    simages remove \u003cpath\u003e ... [--db=\u003cdb_path\u003e]\n    simages clear [--db=\u003cdb_path\u003e]\n    simages show [--db=\u003cdb_path\u003e]\n    simages find \u003cpath\u003e [--print] [--delete] [--match-time] [--trash=\u003ctrash_path\u003e] [--db=\u003cdb_path\u003e] [--epochs=\u003cepochs\u003e]\n    simages -h | --help\nOptions:\n    -h, --help                Show this screen\n    --db=\u003cdb_path\u003e            The location of the database or a MongoDB URI. (default: ./db)\n    --parallel=\u003cnum_processes\u003e The number of parallel processes to run to hash the image\n                               files (default: number of CPUs).\n    find:\n        --print               Only print duplicate files rather than displaying HTML file\n        --delete              Move all found duplicate pictures to the trash. This option takes priority over --print.\n        --match-time          Adds the extra constraint that duplicate images must have the\n                              same capture times in order to be considered.\n        --trash=\u003ctrash_path\u003e  Where files will be put when they are deleted (default: ./Trash)\n        --epochs=\u003cepochs\u003e     Epochs for training [default: 2]\n```\n\n\n### Python APIs\n\n#### Numpy array\n\n```python\nfrom simages import find_duplicates\nimport numpy as np\n\narray_data = np.random.random(100, 3, 48, 48)# N x C x H x W\npairs, distances = find_duplicates(array_data)\n \n```\n\n#### Folder\n\n```python\nfrom simages import find_duplicates\n\ndata_dir = \"my_images_folder\"\npairs, distances = find_duplicates(data_dir)\n \n```\n\nDefault options for `find_duplicates` are:\n\n```python\ndef find_duplicates(\n    input: Union[str or np.ndarray],\n    n: int = 5,\n    num_epochs: int = 2,\n    num_channels: int = 3,\n    show: bool = False,\n    show_train: bool = False,\n    **kwargs\n):\n    \"\"\"Find duplicates in dataset. Either `array` or `data_dir` must be specified.\n\n    Args:\n        input (str or np.ndarray): folder directory or N x C x H x W array\n        n (int): number of closest pairs to identify\n        num_epochs (int): how long to train the autoencoder (more is generally better)\n        show (bool): display the closest pairs\n        show_train (bool): show output every\n        z_dim (int): size of compression (more is generally better, but slower)\n        kwargs (dict): etc, passed to `EmbeddingExtractor`\n\n    Returns:\n        pairs (np.ndarray): indices for closest pairs of images, n x 2 array\n        distances (np.ndarray): distances of each pair to each other\n```\n\n#### `Embeddings` API\n\n```python\nfrom simages import Embeddings\nimport numpy as np\n\nN = 1000\ndata = np.random.random((N, 28, 28))\nembeddings = Embeddings(data)\n\n# Access the array\narray = embeddings.array # N x z (compression size)\n\n# Get 10 closest pairs of images\npairs, distances = embeddings.duplicates(n=5)\n\n```\n\n```python\nIn [0]: pairs\nOut[0]: array([[912, 990], [716, 790], [907, 943], [483, 492], [806, 883]])\n\nIn [1]: distances\nOut[1]: array([0.00148035, 0.00150703, 0.00158789, 0.00168699, 0.00168721])\n```\n\n#### `EmbeddingExtractor` API\n\n```python\nfrom simages import EmbeddingExtractor\nimport numpy as np\n\nN = 1000\ndata = np.random.random((N, 28, 28))\nextractor = EmbeddingExtractor(data, num_channels=1) # grayscale\n\n# Show 10 closest pairs of images\npairs, distances = extractor.show_duplicates(n=10)\n\n```\n\nClass attributes and parameters:\n\n```python\nclass EmbeddingExtractor:\n    \"\"\"Extract embeddings from data with models and allow visualization.\n\n    Attributes:\n        trainloader (torch loader)\n        evalloader (torch loader)\n        model (torch.nn.Module)\n        embeddings (np.ndarray)\n\n    \"\"\"\n    def __init__(\n        self,\n        input:Union[str, np.ndarray],\n        num_channels=None,\n        num_epochs=2,\n        batch_size=32,\n        show_train=True,\n        show=False,\n        z_dim=8,\n        **kwargs,\n    ):\n    \"\"\"Inits EmbeddingExtractor with input, either `str` or `np.nd.array`, performs training and validation.\n    \n    Args:\n    input (np.ndarray or str): data\n    num_channels (int): grayscale = 1, color = 3\n    num_epochs (int): more is better (generally)\n    batch_size (int): number of images per batch\n    show_train (bool): show intermediate training results\n    show (bool): show closest pairs\n    z_dim (int): compression size\n    kwargs (dict)\n    \n    \"\"\"\n\n```\n\nSpecify tne number of pairs to identify with the parameter `n`.\n \n### How it works\n\n*simages* uses a convolutional autoencoder with PyTorch and compares the latent representations with [closely](https://github.com/justinshenk/closely) :triangular_ruler:.\n\n#### Dependencies\n\n*simages* depends on\nthe following packages:\n\n- [closely](https://github.com/justinshenk/closely)\n- [torch](https://pytorch.org)\n- [torchvision](https://pytorch.org)\n- scikit-learn\n- matplotlib\n\nThe following dependencies are required for the interactive deleting interface:\n \n- pymongodb\n- fastcluster\n- flask\n- jinja2\n- dnspython\n- python-magic\n- termcolor\n\n### Cite\n\nIf you use simages, please cite it:\n```\n    @misc{justin_shenk_2019_3237830,\n      author       = {Justin Shenk},\n      title        = {justinshenk/simages: v19.0.1},\n      month        = jun,\n      year         = 2019,\n      doi          = {10.5281/zenodo.3237830},\n      url          = {https://doi.org/10.5281/zenodo.3237830}\n    }\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjustinshenk%2Fsimages","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjustinshenk%2Fsimages","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjustinshenk%2Fsimages/lists"}