{"id":16526151,"url":"https://github.com/markusressel/py-image-dedup","last_synced_at":"2025-04-07T13:08:12.978Z","repository":{"id":37884934,"uuid":"131913341","full_name":"markusressel/py-image-dedup","owner":"markusressel","description":"CLI utility to find near duplicate images and remove all but the best copy.","archived":false,"fork":false,"pushed_at":"2024-05-29T01:58:49.000Z","size":18645,"stargazers_count":150,"open_issues_count":7,"forks_count":15,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-05-29T15:45:10.060Z","etag":null,"topics":["dedup","deduplication","duplicate-detection","duplicate-images","find-duplicates","hacktoberfest","image-analysis","image-comparison","python","python-3","python3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/markusressel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["markusressel"]}},"created_at":"2018-05-02T22:42:23.000Z","updated_at":"2024-05-30T18:24:30.181Z","dependencies_parsed_at":"2023-10-11T01:53:38.526Z","dependency_job_id":"ccb677e1-7a31-4999-80a7-66999103907d","html_url":"https://github.com/markusressel/py-image-dedup","commit_stats":{"total_commits":451,"total_committers":7,"mean_commits":64.42857142857143,"dds":0.6674057649667406,"last_synced_commit":"1c1603785c9e28552de8d6bf15c8acc4687ff311"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/markusressel%2Fpy-image-dedup","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/markusressel%2Fpy-image-dedup/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/markusressel%2Fpy-image-dedup/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/markusressel%2Fpy-image-dedup/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/markusressel","download_url":"https://codeload.github.com/markusressel/py-image-dedup/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247657281,"owners_count":20974345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dedup","deduplication","duplicate-detection","duplicate-images","find-duplicates","hacktoberfest","image-analysis","image-comparison","python","python-3","python3"],"created_at":"2024-10-11T17:26:19.106Z","updated_at":"2025-04-07T13:08:12.952Z","avatar_url":"https://github.com/markusressel.png","language":"Python","funding_links":["https://github.com/sponsors/markusressel"],"categories":["Python"],"sub_categories":[],"readme":"# py-image-dedup [![Build Status](https://img.shields.io/endpoint.svg?url=https%3A%2F%2Factions-badge.atrox.dev%2Fmarkusressel%2Fpy-image-dedup%2Fbadge%3Fref%3Dmaster\u0026style=flat)](https://actions-badge.atrox.dev/markusressel/py-image-dedup/goto?ref=master) [![Code Climate](https://codeclimate.com/github/markusressel/py-image-dedup.svg)](https://codeclimate.com/github/markusressel/py-image-dedup) [![PyPI version](https://badge.fury.io/py/py-image-dedup.svg)](https://badge.fury.io/py/py-image-dedup)\n\n**py-image-dedup** is a tool to sort out or remove duplicates within a photo library. \nUnlike most other solutions, **py-image-dedup** \nintentionally uses an approximate image comparison to also detect \nduplicates of images that slightly differ in resolution, color or other minor details.\n\nIt is build upon [Image-Match](https://github.com/ascribe/image-match) a very popular library to compute\na pHash for an image and store the result in an ElasticSearch backend for very high scalability.\n\n[![asciicast](https://asciinema.org/a/3WbBxMXnZyT1QnuTP9fm37wkS.svg)](https://asciinema.org/a/3WbBxMXnZyT1QnuTP9fm37wkS)\n\n# How it works\n\n### Phase 1 - Database cleanup\n\nIn the first phase the elasticsearch backend is checked against the \ncurrent filesystem state, cleaning up database entries of files that \nno longer exist. This will speed up queries made later on.\n\n### Phase 2 - Counting files\n\nAlthough not necessary for the deduplication process it is very convenient\nto have some kind of progress indication while the deduplication process\nis at work. To be able to provide that, available files must be counted beforehand.\n\n### Phase 3 - Analysing files\n\nIn this phase every image file is analysed. This means generating a signature (pHash)\nto quickly compare it to other images and adding other metadata of the image\nto the elasticsearch backend that is used in the next phase.\n\nThis phase is quite CPU intensive and the first run take take quite\nsome time. Using as much threads as feasible (using the `-t` parameter) \nis advised to get the best performance.\n\nSince we might already have a previous version of this file in the database \nbefore analysing a given file the file modification time is compared to the\ngiven one. If the database content seems to be still correct the signature\nfor this file will **not** be recalculated. Because of this, subsequent\nruns will be much faster. There still has to happen some file access though,\nso it is probably limited by that.\n\n### Phase 4 - Finding duplicates\n\nEvery file is now processed again - but only by means of querying the\ndatabase backend for similar images (within the given `max_dist`).\nIf there are images found that match the similarity criteria they are considered\nduplicate candidates. All candidates are then ordered according to the `prioritization_rules`,\nwhich you can specify yourself in the configuration, see [Configuration](#Configuration).\n\nIf you do not specify `prioritization_rules` yourself, the following order will\nbe used:\n\n1. pixel count (more is better)\n1. EXIF data (more exif data is better)\n1. file size (bigger is better)\n1. file modification time (newer is better)\n1. distance (lower is better)\n1. filename contains \"copy\" (False is better)\n1. filename length (longer is better) - (for \"edited\" versions)\n1. parent folder path length (shorter is better)\n1. score (higher is better)\n\nThe first candidate in the resulting list is considered to be the best\navailable version of all candidates.\n \n### Phase 5 - Moving/Deleting duplicates\n\nAll but the best version of duplicate candidates identified in the previous\nphase are now deleted from the file system (if you didn't specify `--dry-run` of course).\n\nIf `duplicates_target_directory` is set, the specified folder will be used as\na root directory to move duplicates to, instead of deleting them, replicating their original \nfolder structure.\n \n### Phase 6 - Removing empty folders (Optional)\n\nIn the last phase, folders that are empty due to the deduplication \nprocess are deleted, cleaning up the directory structure (if turned on in configuration).\n\n# How to use\n\n## Install\n\nInstall **py-image-dedup** using pip:\n\n```shell\npip3 install py-image-dedup\n```\n\n## Configuration\n\n**py-image-dedup** uses [container-app-conf](https://github.com/markusressel/container-app-conf)\nto provide configuration via a YAML file as well as ENV variables which\ngenerates a reference config on startup. Have a look at the\n[documentation about it](https://github.com/markusressel/container-app-conf#generate-reference-config).\n\nSee [py_image_dedup_reference.yaml](/py_image_dedup_reference.yaml)\nfor an example in this repo.\n\n## Setup elasticsearch backend\n\nSince this library is based on [Image-Match](https://github.com/ascribe/image-match) \nyou need a running elasticsearch instance for efficient storing and \nquerying of image signatures.\n\n### Elasticsearch version\n\nThis library requires elasticsearch version 5 or later. Sadly the\n[Image-Match](https://github.com/ascribe/image-match) library still \nspecifies version 2, so [a fork of the original project](https://github.com/markusressel/image-match)\n is used instead. This fork is maintained by me, and any contributions\n are very much appreciated.\n\n### Set up the index\n\n**py-image-dedup** uses a single index (called `images` by default).\nWhen configured, this index will be created automatically for you. \n\n## Command line usage\n\n**py-image-dedup** can be used from the command line like this:\n\n```shell\npy-image-dedup deduplicate --help\n```\n\nHave a look at the help output to see how you can customize it.\n\n### Daemon\n\n**CAUTION!** This feature is still very much a work in progress. \n**Always** have a backup of your data! \n\n**py-image-dedup** has a built in daemon that allows you to continuously\nmonitor your source directories and deduplicate them on the fly.\n\nWhen running the daemon (and enabled in configuration) a prometheus reporter\nis used to allow you to gather some statistical insights.\n\n```shell\npy-image-dedup daemon\n```\n\n## Dry run\n\nTo analyze images and get an overview of what images would be deleted \nbe sure to make a dry run first.\n\n```shell\npy-image-dedup deduplicate --dry-run\n```\n\n\n## FreeBSD\n\nIf you want to run this on a FreeBSD host make sure you have an up\nto date release that is able to install ports.\n\nSince [Image-Match](https://github.com/ascribe/image-match) does a lot of\nmath it relies on `numpy` and `scipy`. To get those working on FreeBSD\nyou have to install them as a port:\n\n```shell\npkg install pkgconf\npkg install py38-numpy\npkg install py27-scipy\n```\n\nFor `.png` support you also need to install\n```shell\npkg install png\n```\n\nI still ran into issues after installing all these and just threw those\ntwo in the mix and it finally worked:\n```shell\npkg install freetype\npkg install py27-matplotlib  # this has a LOT of dependencies\n```\n\n### Encoding issues\n\nWhen using the python library `click` on FreeBSD you might run into\nencoding issues. To mitigate this change your locale from `ANSII` to `UTF-8`\nif possible.\n\nThis can be achieved f.ex. by creating a file `~/.login_conf` with the following content:\n```text\nme:\\\n\t:charset=ISO-8859-1:\\\n\t:lang=de_DE.UTF-8:\n```\n\n## Docker\n\nTo run **py-image-dedup** using docker you can use the [markusressel/py-image-dedup](https://hub.docker.com/r/markusressel/py-image-dedup) \nimage from DockerHub:\n\n```\nsudo docker run -t \\\n    -p 8000:8000 \\\n    -v /where/the/original/photolibrary/is/located:/data/in \\\n    -v /where/duplicates/should/be/moved/to:/data/out \\\n    -e PY_IMAGE_DEDUP_DRY_RUN=False \\\n    -e PY_IMAGE_DEDUP_ANALYSIS_SOURCE_DIRECTORIES=/data/in/ \\\n    -e PY_IMAGE_DEDUP_ANALYSIS_RECURSIVE=True \\\n    -e PY_IMAGE_DEDUP_ANALYSIS_ACROSS_DIRS=True \\\n    -e PY_IMAGE_DEDUP_ANALYSIS_FILE_EXTENSIONS=.png,.jpg,.jpeg \\\n    -e PY_IMAGE_DEDUP_ANALYSIS_THREADS=8 \\\n    -e PY_IMAGE_DEDUP_ANALYSIS_USE_EXIF_DATA=True \\\n    -e PY_IMAGE_DEDUP_DEDUPLICATION_DUPLICATES_TARGET_DIRECTORY=/data/out/ \\\n    -e PY_IMAGE_DEDUP_ELASTICSEARCH_AUTO_CREATE_INDEX=True \\\n    -e PY_IMAGE_DEDUP_ELASTICSEARCH_HOST=elasticsearch \\\n    -e PY_IMAGE_DEDUP_ELASTICSEARCH_PORT=9200 \\\n    -e PY_IMAGE_DEDUP_ELASTICSEARCH_INDEX=images \\\n    -e PY_IMAGE_DEDUP_ELASTICSEARCH_AUTO_CREATE_INDEX=True \\\n    -e PY_IMAGE_DEDUP_ELASTICSEARCH_MAX_DISTANCE=0.1 \\\n    -e PY_IMAGE_DEDUP_REMOVE_EMPTY_FOLDERS=False \\\n    -e PY_IMAGE_DEDUP_STATS_ENABLED=True \\\n    -e PY_IMAGE_DEDUP_STATS_PORT=8000 \\\n    markusressel/py-image-dedup:latest\n```\n\nSince an elasticsearch instance is required too, you can \nalso use the `docker-compose.yml` file included in this repo which will\nset up a single-node elasticsearch cluster too:\n\n```shell script\nsudo docker-compose up\n```\n\n### UID and GID\n\nTo run **py-image-dedup** inside the container using a specific user id \nand group id you can use the env variables `PUID=1000` and `PGID=1000`.\n\n# Contributing\n\nGitHub is for social coding: if you want to write code, I encourage contributions through pull requests from forks\nof this repository. Create GitHub tickets for bugs and new features and comment on the ones that you are interested in.\n\n# License\n\n```text\npy-image-dedup by Markus Ressel\nCopyright (C) 2018  Markus Ressel\n\nThis program is free software: you can redistribute it and/or modify\nit under the terms of the GNU General Public License as published by\nthe Free Software Foundation, either version 3 of the License, or\n(at your option) any later version.\n\nThis program is distributed in the hope that it will be useful,\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\nGNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License\nalong with this program.  If not, see \u003chttp://www.gnu.org/licenses/\u003e.\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarkusressel%2Fpy-image-dedup","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmarkusressel%2Fpy-image-dedup","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarkusressel%2Fpy-image-dedup/lists"}