An open API service indexing awesome lists of open source software.

https://github.com/suminb/winnowing

A Python implementation of the Winnowing (local algorithms for document fingerprinting)
https://github.com/suminb/winnowing

Last synced: over 1 year ago
JSON representation

A Python implementation of the Winnowing (local algorithms for document fingerprinting)

Awesome Lists containing this project

README

          

Winnowing
=========

A Python implementation of the Winnowing (local algorithms for document
fingerprinting)

Original Work
=============

The original research paper can be found at
http://dl.acm.org/citation.cfm?id=872770.

Installation
============

You may install ``winnowing`` package via ``pip`` as follows:

::

pip install winnowing

Alternatively, you may also install the package by cloning this repository.

::

git clone https://github.com/suminb/winnowing.git
cd winnowing && python setup.py install

Usage
=====

.. code:: python

>>> from winnowing import winnow

>>> winnow('A do run run run, a do run run')
set([(5, 23942), (14, 2887), (2, 1966), (9, 23942), (20, 1966)])

>>> winnow('run run')
set([(0, 23942)]) # match found!

Default Hash Function
~~~~~~~~~~~~~~~~~~~~~

Quite honestly, I did not know what hash function to use. The paper did
not talk about it. So I decided to use a part of SHA-1; more precisely,
the last 16 bits of the digest.

Custom Hash Function
~~~~~~~~~~~~~~~~~~~~

You may use your own hash function as demonstrated below.

.. code:: python

def hash_md5(text):
import hashlib

hs = hashlib.md5(text)
hs = hs.hexdigest()
hs = int(hs, 16)

return hs

# Override the hash function
winnow.hash_function = hash_md5

winnow('The cake was a lie')

Lower Bound of Fingerprint Density
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

(TODO: Write this section)