https://github.com/suminb/winnowing
A Python implementation of the Winnowing (local algorithms for document fingerprinting)
https://github.com/suminb/winnowing
Last synced: over 1 year ago
JSON representation
A Python implementation of the Winnowing (local algorithms for document fingerprinting)
- Host: GitHub
- URL: https://github.com/suminb/winnowing
- Owner: suminb
- License: other
- Created: 2013-09-26T20:21:26.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2019-10-29T17:29:03.000Z (over 6 years ago)
- Last Synced: 2025-03-18T02:44:39.575Z (over 1 year ago)
- Language: Python
- Homepage:
- Size: 21.5 KB
- Stars: 53
- Watchers: 2
- Forks: 15
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
- Changelog: changelog.rst
- License: LICENSE
Awesome Lists containing this project
README
Winnowing
=========
A Python implementation of the Winnowing (local algorithms for document
fingerprinting)
Original Work
=============
The original research paper can be found at
http://dl.acm.org/citation.cfm?id=872770.
Installation
============
You may install ``winnowing`` package via ``pip`` as follows:
::
pip install winnowing
Alternatively, you may also install the package by cloning this repository.
::
git clone https://github.com/suminb/winnowing.git
cd winnowing && python setup.py install
Usage
=====
.. code:: python
>>> from winnowing import winnow
>>> winnow('A do run run run, a do run run')
set([(5, 23942), (14, 2887), (2, 1966), (9, 23942), (20, 1966)])
>>> winnow('run run')
set([(0, 23942)]) # match found!
Default Hash Function
~~~~~~~~~~~~~~~~~~~~~
Quite honestly, I did not know what hash function to use. The paper did
not talk about it. So I decided to use a part of SHA-1; more precisely,
the last 16 bits of the digest.
Custom Hash Function
~~~~~~~~~~~~~~~~~~~~
You may use your own hash function as demonstrated below.
.. code:: python
def hash_md5(text):
import hashlib
hs = hashlib.md5(text)
hs = hs.hexdigest()
hs = int(hs, 16)
return hs
# Override the hash function
winnow.hash_function = hash_md5
winnow('The cake was a lie')
Lower Bound of Fingerprint Density
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(TODO: Write this section)