{"id":19045054,"url":"https://github.com/immobiliare/ufoid","last_synced_at":"2025-04-23T23:43:31.230Z","repository":{"id":259007117,"uuid":"832525727","full_name":"immobiliare/ufoid","owner":"immobiliare","description":"Ultra Fast Optimized Image Deduplication.","archived":false,"fork":false,"pushed_at":"2025-03-31T15:39:12.000Z","size":3857,"stargazers_count":23,"open_issues_count":9,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-18T08:39:33.627Z","etag":null,"topics":["automation","computer-vision","deduplication","images","immobiliare-labs","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/immobiliare.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-07-23T07:46:44.000Z","updated_at":"2025-02-05T11:55:07.000Z","dependencies_parsed_at":"2025-04-17T19:51:40.402Z","dependency_job_id":"ee8de10e-70d4-47be-9519-738271878c58","html_url":"https://github.com/immobiliare/ufoid","commit_stats":null,"previous_names":["immobiliare/ufoid"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/immobiliare%2Fufoid","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/immobiliare%2Fufoid/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/immobiliare%2Fufoid/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/immobiliare%2Fufoid/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/immobiliare","download_url":"https://codeload.github.com/immobiliare/ufoid/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250535057,"owners_count":21446503,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","computer-vision","deduplication","images","immobiliare-labs","python"],"created_at":"2024-11-08T22:48:38.371Z","updated_at":"2025-04-23T23:43:30.874Z","avatar_url":"https://github.com/immobiliare.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n![ImmoLogo](.github/example_images/ImmobiliareLabs_Logo_Negative.png)\n\n\u003c/div\u003e\n\n# UFOID\n\n\u003e Ultra Fast Optimized Image Deduplication.\n\n![Test](https://github.com/immobiliare/ufoid/actions/workflows/ci.yaml/badge.svg)\n![Python 3.9](https://img.shields.io/badge/Python-3.9|3.10|3.11-blue)\n\n\u003cdiv align=\"center\"\u003e\n\n![Example](.github/example_images/ufoid.gif)\n\u003c/div\u003e\n\n## Table of Contents\n\n- [Introduction](#introduction)\n- [Installation](#installation)\n- [Configuration](#configuration)\n- [Results](#results)\n- [Benchmarks](#benchmarks)\n- [Changelog](#changelog)\n- [Contributing](#contributing)\n- [Powered Apps](#powered-apps)\n- [Support](#support)\n\n## Introduction\n\nThe goal of this project is to efficiently detect and handle duplicate images within a dataset and across different\ndatasets. The code uses perceptual hashing (image hashing) to convert images into hash representations, allowing for\nquick comparison and identification of duplicate images based on a specified distance threshold.\n\nThe project provides two main functionalities:\n\n1. Duplicate detection within a single dataset using chunks: this method processes the dataset in smaller chunks to\n   optimize performance for large datasets.\n2. Duplicate detection between two datasets using chunks: this method allows for comparison between a reference dataset\n   and a new dataset to identify any overlapping duplicate images.\n\nThe competitor [imagededup](https://github.com/idealo/imagededup) library, running on hash_sizes of 8 bit, demonstrates\nfast degradation of accuracy when increasing threshold, and shows heavy degradation of computation time when duplicates\nincrease for false positives. In contrast, our library, shows consistent results over higher threshold (and this can be\nimportant to detect near-duplicates as shown in the accuracy benchmark). Furthermore, despite running on 16-bit hashes,\nthe strong computation optimization allow our library to be faster in a consistent way (more than double speed on 100k datasets).\nMore details can be found in benchmarks [README](./benchmarks/README.md).\n\n## Installation\n\n### Clone UFOID\n\n```shell\ngit clone https://github.com/immobiliare/ufoid\ncd ufoid\n```\n\nHere’s a minimal installation guide based on your specified style:\n\n## Installation\n\n### Clone Project Repository\n\n```shell\ngit clone https://github.com/your-username/your-project\ncd your-project\n```\n\n### Create virtualenv and install requirements\n\nIn order to create a clean environment for the execution of the application, a new virtualenv should be created inside the current folder.\n\n#### If using `uv`\n\n```console\nuv venv\n```\n\n#### If using plain Python\n\n```console\npython -m venv venv\nsource venv/bin/activate \n```\n\n### Install project dependencies\n\nOnce the virtual environment is activated, install the project dependencies specified in the `.toml` file:\n\n```console\npip install .\n```\n\nYour environment is now ready to run the project.\n\n### Run tests\n\n```console\nuv run pytest\n```\n\nor\n\n```console\npython -m pytest \n```\n\n## Configuration\n\nClone `ufoid/config/config.yaml.example` and rename it as `config.yaml` allows you to customize various aspects of the\nduplicate detection process.\nHere are some key parameters you can modify:\n\n- `num_processes`: Number of processes for parallel execution.\n- `chunk_length`: The length of each chunk for chunk-based processing. See below for more information.\n- `new_paths`: List of directory paths containing the new dataset for duplicate detection.\n- `old_paths`: List of directory paths containing the old dataset for comparison with the new dataset.\n- `check_with_itself`: Boolean flag to indicate whether to check for duplicates within the new dataset.\n- `check_with_old_data`: Boolean flag to indicate whether to check for duplicates between the new and old datasets.\n- `csv_output`: Boolean flag to indicate whether to save duplicate information to the output file.\n- `csv_output_file`: Path to the output file where duplicate information will be saved.\n- `delete_duplicates`: Boolean flag to indicate whether to delete duplicate images from the dataset.\n- `create_folder_with_no_duplicates`: Boolean flag to indicate whether to create a folder with non-duplicate images.\n- `new_folder`: Path to the folder where non-duplicate images will be stored.\n- `distance_threshold`: The distance threshold for considering images as duplicates. 10 is optimal for our use case,\n  since it allows to get all the exact duplicate (also with some resilience to minor manipulations on images, while\n  avoiding collisions.\n\n#### Chunk length\n\n`chunk_length` is an important parameter used in chunk-based processing to divide a large number of images into smaller,\nmanageable chunks during the duplicate detection process. The size of each chunk is crucial as it directly affects\nmemory usage and processing efficiency. By breaking down the dataset into chunks, we can prevent running out of memory\nand optimize the performance of the duplicate detection algorithm.\n\nThe optimal value of `chunk_length` can vary depending on the hardware specifications of the machine running the\nprocess. For instance, on a machine with limited memory, such as an Apple M1 with 16 GB of RAM, a smaller `chunk_length`\n, like 20,000, has been found to work well. However, on a machine with more memory, a larger `chunk_length` might be\nsuitable.\n\nIt is essential to find a balance between the size of the chunks and the available system resources. If\nthe `chunk_length` is too small, it may lead to a large number of iterations and increased overhead. On the other hand,\nif it is too large, it may cause the process to exhaust available memory, leading to performance issues or even crashes.\n\nTo determine the appropriate `chunk_length` for your specific system, it is recommended to experiment with different\nvalues and observe the memory usage and performance of the duplicate detection process. This way, you can fine-tune the\nparameter to best suit your hardware configuration and dataset size.\n\n#### Distance threshold\n\nThe `distance threshold` is a fundamental parameter that controls the level of sensitivity in duplicate image detection.\nBy setting the distance threshold to 0, the algorithm will only identify exact duplicates—images that are pixel-by-pixel\nidentical. As the threshold is increased, the detection becomes more permissive, allowing for the identification of\nnear-duplicates with minor alterations such as changes in brightness, compression artifacts, or minor edits.\n\nFor instance, when using a distance threshold of 10, the algorithm can identify quasi-duplicates, even if the images\nhave undergone slight modifications, making it suitable for scenarios where minor image manipulations may occur. Raising\nthe threshold to 30 enables the detection of images with more substantial changes, providing increased flexibility while\nstill avoiding unnecessary collision with unrelated images.\n\nHowever, as the threshold reaches higher values, like 60, it becomes more inclusive and may lead to the detection of\nheavily manipulated images, potentially causing collisions with very similar images that are not actual duplicates. For\nexample, two images may share a common scene, but one of them might have a significant addition, like a large printed\nword, which can trigger false positives.\n\nIt's essential to choose an appropriate distance threshold based on the characteristics of the dataset and the specific\nuse case. A balance must be struck between catching meaningful duplicates while minimizing the chances of identifying\nfalse positives or unrelated images as duplicates. Experimenting with different threshold values and observing the\nresults will help determine the most suitable value for a particular scenario.\n\n## Results\n\nStart the script using the following command:\n\n```console\npython -m ufoid\n```\n\nAfter running the script, the duplicate detection process will complete, and the output will be displayed in\nthe console. The detected duplicate image pairs will be saved to the specified output file (if `txt_output` is set\nto `true`).\n\nAdditionally, based on the configuration, a folder with non-duplicate images may be created, and duplicate images may be\ndeleted from the dataset.\n\n## Benchmarks\n\nIn `benchmarks/scripts` different scripts to perform params optimization of UFOID and performance tests are provided.\nFor more details check dedicated [README](./benchmarks/README.md).\n\n## Changelog\n\nSee [CHANGELOG](./CHANGELOG.md).\n\n## Contributing\n\nWe appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further\ndiscussion.\n\nIf you plan to contribute new features, utility functions, or extensions to the core, please first open an issue and\ndiscuss the feature with us.\nSending a PR without discussion might end up resulting in a rejected PR because we might be taking the core in a\ndifferent direction than you might be aware of.\n\nTo learn more about making a contribution, please see our [Contribution page](./CONTRIBUTING.md).\n\n## Powered apps\n\nUFOID was created by ImmobiliareLabs, the technology department of [Immobiliare.it](https://www.immobiliare.it),\nthe #1 real estate company in Italy.\n\n**If you are using UFOID [drop us a message](mailto:opensource@immobiliare.it)**.\n\n## Support\n\nMade with ❤️ by [ImmobiliareLabs](https://github.com/immobiliare) and all the\n[contributors](./CONTRIBUTING.md#contributors).\n\nIf you have any question on how to use UFOID, bugs and enhancement please feel free to reach us out by opening a\n[GitHub Issue](https://github.com/immobiliare/ufoid/issues).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimmobiliare%2Fufoid","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fimmobiliare%2Fufoid","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimmobiliare%2Fufoid/lists"}