https://github.com/adulau/photodna
PhotoDNA
https://github.com/adulau/photodna
image-correlation image-search perceptual-hashing photodna
Last synced: 4 days ago
JSON representation
PhotoDNA
- Host: GitHub
- URL: https://github.com/adulau/photodna
- Owner: adulau
- License: 0bsd
- Created: 2026-03-16T09:49:56.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-03-18T05:39:12.000Z (3 months ago)
- Last Synced: 2026-04-20T23:44:07.022Z (about 1 month ago)
- Topics: image-correlation, image-search, perceptual-hashing, photodna
- Language: Python
- Homepage:
- Size: 3.36 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# oaphotodna.py
`oaphotodna.py` computes PhotoDNA-like hashes (based on the reversed-engineered version available at [https://github.com/ArcaneNibble/open-alleged-photodna](https://github.com/ArcaneNibble/open-alleged-photodna)) for images, compares two images with normalized similarity scoring, and supports a local FAISS-backed nearest-neighbor index for fast lookup of visually similar images.
This version adds:
- a FAISS local vector index
- persistent on-disk metadata in `meta.json`
- exact L2 nearest-neighbor search
- similarity scores normalized to the same `0..1` scale as direct image comparison
- query-time filtering by minimum similarity or maximum Euclidean distance
## Requirements
- Python 3.8+
- [Pillow](https://pypi.org/project/Pillow/)
- [NumPy](https://pypi.org/project/numpy/) (recommended)
- [faiss-cpu](https://pypi.org/project/faiss-cpu/) for FAISS features
Install dependencies:
```bash
pip install pillow numpy faiss-cpu
```
## What the script does
The script supports four main workflows:
1. Compute the hash of a single image.
2. Compute hashes for every file in a directory and emit JSON.
3. Compare two images using either Euclidean or Manhattan distance.
4. Build and query a local FAISS index of previously hashed images.
The PhotoDNA-like hash is represented internally as a flat vector of 144 values. FAISS stores these vectors and searches for nearest neighbors using L2 distance.
## Help
Print the top-level help:
```bash
python bin/oaphotodna.py --help
```
The CLI uses traditional, flag-prefixed arguments (for example `--hash`, `--compare`, `--faiss-query`) rather than positional subcommands.
```bash
adulau@blakley:~/git/photodna/bin$ python3 oaphotodna.py
usage: oaphotodna.py [-h] (--hash IMAGE | --hash-dir DIRECTORY | --compare IMAGE1 IMAGE2 | --faiss-build ARG [ARG ...] | --faiss-add ARG [ARG ...] | --faiss-query ARG [ARG ...]) [--metric {euclidean,manhattan}]
[--min-similarity MIN_SIMILARITY] [--max-distance MAX_DISTANCE]
Compute and compare PhotoDNA-like hashes, with optional FAISS local indexing.
options:
-h, --help show this help message and exit
--hash IMAGE Compute the hash of one image
--hash-dir DIRECTORY Compute hashes for every file in a directory and output JSON
--compare IMAGE1 IMAGE2
Compare two images
--faiss-build ARG [ARG ...]
Create a new FAISS index: INDEX META IMAGE [IMAGE ...]
--faiss-add ARG [ARG ...]
Append images to an existing FAISS index: INDEX META IMAGE [IMAGE ...]
--faiss-query ARG [ARG ...]
Find closest indexed matches: INDEX META QUERY_IMAGE [TOP_K]
--metric {euclidean,manhattan}
Distance metric for --compare
--min-similarity MIN_SIMILARITY
With --faiss-query, filter results below this similarity threshold [0,1]
--max-distance MAX_DISTANCE
With --faiss-query, filter results above this Euclidean distance
```
## Basic usage
### 1) Hash a single image
```bash
python bin/oaphotodna.py --hash image.jpg
```
Output:
```text
73,71,74,32,...
```
### 2) Hash every file in a directory as JSON
```bash
python bin/oaphotodna.py --hash-dir tests/monochrome
```
Example output:
```json
[
{
"filename": "55147310088_42a69416d3_5k.jpg",
"path": "/full/path/to/tests/monochrome/55147310088_42a69416d3_5k.jpg",
"photodna": [73, 71, 74, 32]
}
]
```
Each JSON object includes the base filename, the absolute file path, and the 144-byte PhotoDNA-like vector. Files are processed in sorted filename order, and non-file directory entries are skipped.
### 3) Compare two images
Default metric is Euclidean:
```bash
python bin/oaphotodna.py --compare image1.jpg image2.jpg
```
Use Manhattan distance instead:
```bash
python bin/oaphotodna.py --compare image1.jpg image2.jpg --metric manhattan
```
Example output:
```text
Distance (euclidean): 3.7417
Similarity: 0.998779
```
## Similarity scale
The script reports a normalized similarity value between `0` and `1`.
- `1.0` means identical hashes
- values close to `1.0` mean very similar hashes
- values closer to `0.0` mean more distant hashes
For Euclidean distance, similarity is derived from the maximum possible distance for a 144-dimensional hash with values in the range `0..255`:
```text
similarity = 1 - (euclidean_distance / max_possible_distance)
```
The FAISS query path uses the same normalization so that the `similarity` reported by `--faiss-query` is directly comparable to the `Similarity:` line from `--compare`.
## FAISS local database
### Files used
The local database consists of two files:
- `index.faiss` — the FAISS vector index
- `meta.json` — sidecar metadata used to map FAISS IDs back to files and hashes
### What `meta.json` contains
`meta.json` stores information that FAISS does not store for you in an application-friendly way:
- `dimension` — vector length, normally `144`
- `metric` — stored metric type
- `next_id` — next numeric ID to assign
- `items` — indexed records
Each item in `items` contains:
- `id` — numeric FAISS ID
- `path` — canonicalized file path
- `hash` — stored 144-element hash
- `extra` — optional metadata placeholder
## Build an index
Create a new index from a set of images:
```bash
python bin/oaphotodna.py --faiss-build index.faiss meta.json img1.jpg img2.jpg img3.jpg
```
Expected output:
```text
Indexed 3 file(s) into index.faiss
```
## Add images to an existing index
Append more images later:
```bash
python bin/oaphotodna.py --faiss-add index.faiss meta.json img4.jpg img5.jpg
```
Expected output:
```text
Added 2 file(s) into index.faiss
```
## Query the index
Search for the closest matches to a query image:
```bash
python bin/oaphotodna.py --faiss-query index.faiss meta.json query.jpg
```
Specify the number of results to return:
```bash
python bin/oaphotodna.py --faiss-query index.faiss meta.json query.jpg 20
```
Example output:
```text
Query: query.jpg
Results: 3
[1] /data/images/img2.jpg
id=17
distance=3.7417
similarity=0.998779
distance_squared=14.0000
[2] /data/images/img7.jpg
id=42
distance=5.2915
similarity=0.998273
distance_squared=28.0000
```
### Filter query results by similarity
Only return matches at or above a similarity threshold:
```bash
python bin/oaphotodna.py --faiss-query index.faiss meta.json query.jpg 20 --min-similarity 0.95
```
### Filter query results by Euclidean distance
Only return matches at or below a maximum Euclidean distance:
```bash
python bin/oaphotodna.py --faiss-query index.faiss meta.json query.jpg 20 --max-distance 12
```
### Combine both filters
```bash
python bin/oaphotodna.py --faiss-query index.faiss meta.json query.jpg 20 --min-similarity 0.98 --max-distance 8
```
## FAISS distance notes
FAISS returns **squared L2 distance** internally.
The script converts that into:
- `distance_squared` — raw FAISS value
- `distance` — Euclidean distance (`sqrt(distance_squared)`)
- `similarity` — normalized `0..1` score derived from Euclidean distance