https://github.com/jina-ai/executor-image-hasher
An executor to encode images using comparable hashing techniques. Useful for duplicate detection
https://github.com/jina-ai/executor-image-hasher
Last synced: 3 months ago
JSON representation
An executor to encode images using comparable hashing techniques. Useful for duplicate detection
- Host: GitHub
- URL: https://github.com/jina-ai/executor-image-hasher
- Owner: jina-ai
- Created: 2021-11-08T09:31:24.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2022-09-12T02:38:35.000Z (about 3 years ago)
- Last Synced: 2025-03-07T03:46:28.966Z (7 months ago)
- Language: Python
- Homepage: https://github.com/jina-ai/executor-image-hasher
- Size: 925 KB
- Stars: 2
- Watchers: 25
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ImageHasher
An executor to encode the images into embeddings using different comparable hashing technique, such as `perceptual`,
`average`, `wavelet`, and `difference` techniques. Features in the image are used to generate a distinct
(but not unique) fingerprint, and these fingerprints are comparable.Comparable hashes are a different concept compared to cryptographic hash functions like `MD5` and `SHA1`. With
cryptographic hashes, the hash values are random. The data used to generate the hash acts like a random seed, so the
same data will generate the same result, but different data will create different results. Comparing two SHA1 hash
values really only tells you two things. If the hashes are different, then the data is different. And if the hashes are
the same, then the data is likely the same. In contrast, comparable hashes such as `perceptual` can be compared --
giving you a sense of similarity between the two data sets. All comparable hashes have the same basic properties:
images can be scaled larger or smaller, have different aspect ratios, and even minor coloring differences (contrast,
brightness, etc.) and they will still match similar images. A good usage of this is the detection of duplicate images.ImageHasher receives the `Documents` with `tensor` attributes. The executor will encode the images into a vector of either
`dtype=np.uint8` or `dtype=np.bool` and store them in the `embedding` attribute of the `Document`.