https://github.com/shreyansh26/minhash-implemenation
A simple MinHash implementation based on the explanation in the Mining of Massive Datasets course by Stanford
https://github.com/shreyansh26/minhash-implemenation
document-similarity minhash minhash-similarity plagiarism-detection
Last synced: 7 months ago
JSON representation
A simple MinHash implementation based on the explanation in the Mining of Massive Datasets course by Stanford
- Host: GitHub
- URL: https://github.com/shreyansh26/minhash-implemenation
- Owner: shreyansh26
- License: mit
- Created: 2022-09-17T18:39:19.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2022-09-17T18:50:23.000Z (about 3 years ago)
- Last Synced: 2025-01-14T02:14:15.618Z (9 months ago)
- Topics: document-similarity, minhash, minhash-similarity, plagiarism-detection
- Language: Python
- Homepage:
- Size: 7.4 MB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MinHash Implementation
A simple MinHash ([original paper link](https://web.archive.org/web/20150131043133/http://gatekeeper.dec.com/ftp/pub/dec/SRC/publications/broder/positano-final-wpnums.pdf)) implementation to identify similar documents based on keywords. A good explanation can be found in the Mining of Massive Datasets course by Stanford. [Chapter 3 till Section 3.3](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf) covers MinHashing and all the concepts required to understand the code.
For large number of documents (10000) in this case, MinHashing is correctly able to identify all 80 pairs of plagiarized documents correctly.
## Overview of steps involved
1. Parse ground truth data to create plagiarized document mappings
2. Converting documents to 3-word shingles and create mapping
3. Defining similarity matrices. Use triangular matrices to reduce memory complexity
4. Creating MinHash signatures for each document
5. Comparing all signatures