https://github.com/18520339/finding-similar-images
Finding similar images from image URLs using ImageHash
https://github.com/18520339/finding-similar-images
data-preparation google-sheets-api gspread imagehash similar-images
Last synced: about 1 year ago
JSON representation
Finding similar images from image URLs using ImageHash
- Host: GitHub
- URL: https://github.com/18520339/finding-similar-images
- Owner: 18520339
- Created: 2021-02-05T10:26:54.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2021-11-13T23:41:37.000Z (over 4 years ago)
- Last Synced: 2025-04-12T18:57:06.743Z (about 1 year ago)
- Topics: data-preparation, google-sheets-api, gspread, imagehash, similar-images
- Language: Python
- Homepage: https://www.youtube.com/watch?v=G3kVp-01nn8
- Size: 1.72 MB
- Stars: 8
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Finding similar images

## Introduction
- In my [Data Science project](https://github.com/18520339/vietnamese-foods), my team had to collect images through many kinds of **Search Engines** for creating dataset and we chose **Google Sheets** for assigning labeling tasks to each member because of its convenient.
- There are lots of similar images when crawling from the Internet, this will result in biases in the dataset. Here is my solution to filter similar images for the **Data Preparation** step.
## Implementation
1. Get image urls from **Search Engines**. I have a repo for that [here](https://github.com/18520339/image-search-downloader)
2. Copy + paste these urls to **Google Sheets**. Here, we can see how similar images arranged next to each other
3. Connect to **Google Sheets** using **Python**
4. If just using 1 hash value, some images will be said to be the same even if they are different. Therefore, we decided to caculate 3 hash values for each 2 images:
- Average hashing ([ahash](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html))
- Perceptual hashing ([phash](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html))
- Difference hashing ([dhash](http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html))

5. If the distances of 2 in these 3 values tell 2 images are similar (≤ **different points**) then arrange these images next to each other
```python
distances = [ahash0 - ahash1, phash0 - phash1, dhash0 - dhash1]
diff_results = sum(dist < args['diff'] for dist in distances)
if diff_results >= 2:
print(f'|--Similar with url {idx1 + 1}: {url1}')
```
6. Decide what images to keep and begin labeling

## Usage
1. Install libraries: `pip install -r requirements.txt`
2. Sort similar images in **Google Sheets**:
- Example: `python sort_similar.py -s "example" -w "Sheet1" -r "B2:C" -a credentials.json`
```
usage: sort_similar.py [-h] -s SPREADSHEET -w WORKSHEET -r RANGE -a AUTH [-d DIFF]
optional arguments:
-h, --help show this help message and exit
-s SPREADSHEET, --spreadsheet SPREADSHEET spreadsheet name
-w WORKSHEET, --worksheet WORKSHEET worksheet name
-r RANGE, --range RANGE updated range
-a AUTH, --auth AUTH credentials file
-d DIFF, --diff DIFF different points
```
3. Download images from urls in **Google Sheets**:
- Example: `python download_images.py -s "example" -w "Sheet1" -r "B2:C" -a credentials.json -o images/`
```
usage: download_images.py [-h] -s SPREADSHEET -w WORKSHEET -r RANGE -a AUTH -o OUT
optional arguments:
-h, --help show this help message and exit
-s SPREADSHEET, --spreadsheet SPREADSHEET spreadsheet name
-w WORKSHEET, --worksheet WORKSHEET worksheet name
-r RANGE, --range RANGE updated range
-a AUTH, --auth AUTH credentials file
-o OUT, --out OUT path to images directory
```
## Reference
- [How to determine whether 2 images are equal or not with the perceptual hash in Python](https://ourcodeworld.com/articles/read/1006/how-to-determine-whether-2-images-are-equal-or-not-with-the-perceptual-hash-in-python)
- [A Python Perceptual Image Hashing Module](https://github.com/JohannesBuchner/imagehash)