https://github.com/marcnuth/deduplication
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
https://github.com/marcnuth/deduplication
algorithms cv deduplication google imagehash shingling simhash
Last synced: 5 months ago
JSON representation
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
- Host: GitHub
- URL: https://github.com/marcnuth/deduplication
- Owner: Marcnuth
- License: apache-2.0
- Created: 2019-08-19T04:30:19.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2023-08-28T09:43:09.000Z (about 2 years ago)
- Last Synced: 2024-11-15T09:52:19.745Z (11 months ago)
- Topics: algorithms, cv, deduplication, google, imagehash, shingling, simhash
- Language: Python
- Homepage:
- Size: 22.5 KB
- Stars: 16
- Watchers: 2
- Forks: 6
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# deduplication

Remove duplicate documents via popular algorithms such as SimHash, SpotSig, Shingling, etc.
## Install
Run following commands:
```
# install current library
pip install deduplication# install required pretrained NLP models
python -m spacy download xx_ent_wiki_sm
python -m spacy download en_core_web_sm
```## Example
__SimHash__
```python
from deduplication import simhashhashvalue1 = simhash('this is text')
hashvalue2 = simhash('this is another text', n_block=4)
```__L-SimHash__
```python
from deduplication import lsimhashhashvalue = lsimhash('this is very long article texts. maybe with a lot of sentences.')
```## Citation
__SimHash__
```
Sadowski C, Levin G.
Simhash: Hash-based similarity detection[J].
Technical report, Google, 2007.
```