https://github.com/italo-batista/lsh-semantic-similarity
Locality Sensitive Hashing for semantic similarity (Python 3.x)
https://github.com/italo-batista/lsh-semantic-similarity
jaccard-similarity lsh textual-analysis tutorial
Last synced: 3 months ago
JSON representation
Locality Sensitive Hashing for semantic similarity (Python 3.x)
- Host: GitHub
- URL: https://github.com/italo-batista/lsh-semantic-similarity
- Owner: italo-batista
- Created: 2017-11-14T20:45:06.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2018-06-08T01:34:48.000Z (over 7 years ago)
- Last Synced: 2025-07-01T03:02:55.131Z (3 months ago)
- Topics: jaccard-similarity, lsh, textual-analysis, tutorial
- Language: Python
- Homepage:
- Size: 9.77 KB
- Stars: 15
- Watchers: 1
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Locality Sensitive Hashing for semantic similarity
###### [](http://forthebadge.com) vs 3.x
LSH (Locality Sensitive Hashing) is primarily used to find, given a large set of documents, the near-duplicates among them.
It can use hamming distance, jaccard coefficient, edit distance or other distance notion.You can read the following tutorials if you want to understand more about it:
- [Ravi Kumar's work](https://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf)
- [Matti Lyra's work](https://mattilyra.github.io/2017/05/23/document-deduplication-with-lsh.html)
- [Insideops](https://insideops.wordpress.com/2015/07/30/similarity-search-and-hashing-for-text-documents/)Although LSH is more to duplicated documents than to semantic similar ones, in this approach I make an effort to use LSH
to calculate semantic similarity among texts. For that, the algorithm extracts, using TFIDF, the text's main tokens
(or you can pre-calculate them and pass as param). Also, in this approach I use MinHash (which uses Jaccard similarity) as the
Similarity function.**The overall aim is to reduce the number of comparisons needed to find similar items. LSH uses hash collisions to capture objects similarities.**
The hash collisions come in handy here as similar documents have a high probability of having the same hash value.
The probability of a hash collision for a minhash is exactly the Jaccard similarity of two sets.See [this tutorial](tutorial.ipynb) to see how use this LSH!
Run as following to install dependencies:
```
python3 -m pip install -r requirements.txt
```