{"id":27942510,"url":"https://github.com/coincheung/image-dedup","last_synced_at":"2026-04-24T16:04:55.689Z","repository":{"id":196311383,"uuid":"695481760","full_name":"CoinCheung/image-dedup","owner":"CoinCheung","description":"Codebase I use for deduplication of image datasets","archived":false,"fork":false,"pushed_at":"2025-03-21T10:55:35.000Z","size":106,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-05-07T11:56:46.626Z","etag":null,"topics":["cpp","dhash","opencv"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CoinCheung.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-23T10:16:53.000Z","updated_at":"2025-03-21T10:55:38.000Z","dependencies_parsed_at":null,"dependency_job_id":"b7b24074-9e41-4dae-be02-7b7d414fe380","html_url":"https://github.com/CoinCheung/image-dedup","commit_stats":null,"previous_names":["coincheung/image-dedup"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoinCheung%2Fimage-dedup","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoinCheung%2Fimage-dedup/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoinCheung%2Fimage-dedup/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoinCheung%2Fimage-dedup/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CoinCheung","download_url":"https://codeload.github.com/CoinCheung/image-dedup/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252873989,"owners_count":21817711,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","dhash","opencv"],"created_at":"2025-05-07T11:56:48.688Z","updated_at":"2026-04-24T16:04:55.593Z","avatar_url":"https://github.com/CoinCheung.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# image-dedup\n\nThis is a codebase I use for deduplication of images in datasets. It is based on dhash.  \n\n\n## Install dependencies \n\nOpenCV is required:  \n```\n    $ sudo apt install libopencv-dev pkg-config libssl-dev\n```\n\n\n## Build \n\nOnly one source file:  \n```\n    $ g++ -O2 main.cpp -std=c++14 -o run_dedup -lpthread -lcrypto $(pkg-config --libs --cflags opencv)\n```\n\n\n## Usage  \n\n### filter by quality\n\nHere we can filter out some images that is considered to be not so qualified. The rules are:\u003cbr\u003e\n* postfix is jpg/jpeg, but file neither starts with `0xff0xd8`, nor ends with `0xff0xd9`.\n* postfix is png, but file neither starts with `89 50 4e 47 0d 0a 1a 0a`, nor ends with `49 45 4e 44 ae 42 60 82`.\n* file size is less than 50k.\n* shorter side of the image is less than 64.\n* longer side of the image is greater than 2048.\n* image channel number is not 3.\n* ratio of longer side and shorter side is greater than 4.\n\u003cbr\u003e\nThough these rules does not always point to low-quality images, I just find it is helpful for me picking out images from huge amount of images.\u003cbr\u003e\n\nIf one needs to carry out the filtering as above mentioned, one should prepare a file `annos/images.txt` that contains paths to images in such format:\n```\n    /path/to/image1\n    /path/to/image2\n    /path/to/image3\n    ...\n```\nand then run the command:\n```\n    $ export n_proc=64 # how many cpu cores to use \n    $ ./run_dedup filter $n_proc annos/images.txt\n```\nThis would generate a `annos/images.txt.filt` with same format as the `annos/images.txt`.\n\n\n### dedup by md5\n#### Step 1. generate all md5\nFirstly, prepare a file contains paths to images in such format:  \n```\n    /path/to/image1\n    /path/to/image2\n    /path/to/image3\n    ...\n```\nSuppose we store this file as `annos/images.txt`.\n\u003cbr /\u003e\u003cbr /\u003e\nThen we run the command:  \n```\n    ## NOTE: do not change the order of the args\n    $ export n_proc=64 # how many cpu cores to use \n    $ ./run_dedup gen_md5 $n_proc annos/images.txt\n```\nThis would generate a `annos/images.txt.md5`, which is in the format of: \n```\n    /path/to/image1,md5value1\n    /path/to/image2,md5value2\n    /path/to/image3,md5value3\n    ...\n```\n\n#### Step 2. deduplicate via md5\nThis step is mean for filtering out identical files(with every bits identical). Just run command:\n```\n    $ export n_proc=64 # how many cpu cores to use \n    $ ./run_dedup dedup_md5 annos/images.txt.md5\n```\nThis would generate a `annos/images.txt.md5.dedup`, and each line is a pure path to a file like this:\n```\n    /path/to/image1\n    /path/to/image2\n    /path/to/image3\n    ...\n```\n\n\n### dedup by dhash\n#### Step 1. generate dhash codes of images in one dataset  \nWe can use an anno file with its format is like this:\n```\n    /path/to/image1\n    /path/to/image2\n    /path/to/image3\n    ...\n```\nHere we assume that this file is saved at `annos/images.txt`:\n\u003cbr /\u003e\u003cbr /\u003e\n\nWe run this command to generate dhash codes of the images specified in above anno file:\n```\n    $ export n_proc=64 # how many cpu cores to use \n    $ ./run_dedup gen_dhash $n_proc annos/images.txt\n```\nThen we will see a `annos/images.txt.dhash` generated, and its format is like this:\n```\n    /path/to/image1,hash1\n    /path/to/image2,hash2\n    /path/to/image3,hash3\n    ...\n```\n\n\n#### Step 2. deduplicate via dhash  \nThis step filter out sample pairs that the bit difference of their hash codes are within a margin.\u003cbr\u003e\nAfter above step, we can run this command:  \n```\n    ## NOTE: do not change the order of the args\n    $ export n_proc=64 # how many cpu cores to use \n    $ ./run_dedup dedup_dhash $n_proc annos/images.txt.dhash\n```\nThis would generate a `annos/images.txt.dhash.dedup`, which is in same format as `annos/images.txt`: \n```\n    /path/to/image1\n    /path/to/image2\n    /path/to/image3\n    ...\n```\nHere I use dhash as long as 2048 bits, rather than the 128 bits in some blog. I believe that longer dhash code can same more details of images, which is helpful.\n\n\n#### merge multiple deduplicated image datasets\nWe can also merge multiple datasets which has already been deduplicated, like this:  \n```\n    ## must run dedup on each datasets\n    $ export n_proc=64 # how many cpu cores to use \n\n    $ ./run_dedup merge $n_proc annos/images_1.txt.dhash annos/images_2.txt.dhash annos/images_3.txt.dhash annos/merged.txt\n```\nThe generated `annos/merged.txt` is the file with merged image paths.  \n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoincheung%2Fimage-dedup","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcoincheung%2Fimage-dedup","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoincheung%2Fimage-dedup/lists"}