https://github.com/scrapinghub/python-simhash

An efficient simhash implementation for python
https://github.com/scrapinghub/python-simhash

data-science

Last synced: 20 days ago
JSON representation

An efficient simhash implementation for python

Host: GitHub
URL: https://github.com/scrapinghub/python-simhash
Owner: scrapinghub
License: bsd-3-clause
Created: 2014-08-05T12:23:46.000Z (almost 11 years ago)
Default Branch: master
Last Pushed: 2019-10-25T15:27:04.000Z (over 5 years ago)
Last Synced: 2025-04-25T02:43:33.653Z (3 months ago)
Topics: data-science
Language: C
Size: 6.84 KB
Stars: 124
Watchers: 12
Forks: 31
Open Issues: 4
Metadata Files:
- Readme: README.rst
- License: LICENSE

Awesome Lists containing this project

README

        =======

simhash

=======

This is an efficient implementation of some functions that are useful for implementing near duplicate detection based on `Charikar's simhash `_. It is a python module, written in C with GCC extentions, and includes the following functions:

`fingerprint`

    Generates a fingerprint from a sequence of hashes

`weighted_fingerprint`

    Generate a fingerprint from a sequence of (long long, weight) tuples

`fnvhash`

    Generates a (FNV-1a) hash from a string

`hamming_distance`

    Calculate the number of bits that differ between 2 long long integers

`simpair_indices`

    Find the indices of hashes in a sequence that differ by less than a certain number of bits. It includes arguments for rotating and grouping hashes. It can be used to help efficiently implement online or batch near duplicate detection, for example as described in `Detecting Near-Duplicates for Web Crawling `_ by Gurmeet Manku, Arvind Jain, and Anish Sarma.

Example usage

-------------

Generate hashes::

    >>> from simhash import fingerprint

    >>> hash1 = fingerprint(map(hash, "some text we want to hash"))

    >>> hash2 = fingerprint(map(hash, "some more text we want to hash"))

Measure distance between hashes::

    >>> from simhash import hamming_distance

    >>> hamming_distance(hash1, hash2)

    2L

This code was used from mapreduce jobs against a large dataset of webpages as part of a prototype at Scrapinghub.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scrapinghub/python-simhash

Awesome Lists containing this project

README