https://github.com/shreyansh26/minhash-implemenation

A simple MinHash implementation based on the explanation in the Mining of Massive Datasets course by Stanford
https://github.com/shreyansh26/minhash-implemenation

document-similarity minhash minhash-similarity plagiarism-detection

Last synced: 7 months ago
JSON representation

A simple MinHash implementation based on the explanation in the Mining of Massive Datasets course by Stanford

Host: GitHub
URL: https://github.com/shreyansh26/minhash-implemenation
Owner: shreyansh26
License: mit
Created: 2022-09-17T18:39:19.000Z (about 3 years ago)
Default Branch: master
Last Pushed: 2022-09-17T18:50:23.000Z (about 3 years ago)
Last Synced: 2025-01-14T02:14:15.618Z (9 months ago)
Topics: document-similarity, minhash, minhash-similarity, plagiarism-detection
Language: Python
Homepage:
Size: 7.4 MB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# MinHash Implementation

A simple MinHash ([original paper link](https://web.archive.org/web/20150131043133/http://gatekeeper.dec.com/ftp/pub/dec/SRC/publications/broder/positano-final-wpnums.pdf)) implementation to identify similar documents based on keywords. A good explanation can be found in the Mining of Massive Datasets course by Stanford. [Chapter 3 till Section 3.3](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf) covers MinHashing and all the concepts required to understand the code.

For large number of documents (10000) in this case, MinHashing is correctly able to identify all 80 pairs of plagiarized documents correctly.

## Overview of steps involved
1. Parse ground truth data to create plagiarized document mappings
2. Converting documents to 3-word shingles and create mapping
3. Defining similarity matrices. Use triangular matrices to reduce memory complexity
4. Creating MinHash signatures for each document
5. Comparing all signatures

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shreyansh26/minhash-implemenation

Awesome Lists containing this project

README