https://github.com/italo-batista/lsh-semantic-similarity

Locality Sensitive Hashing for semantic similarity (Python 3.x)
https://github.com/italo-batista/lsh-semantic-similarity

jaccard-similarity lsh textual-analysis tutorial

Last synced: 3 months ago
JSON representation

Locality Sensitive Hashing for semantic similarity (Python 3.x)

Host: GitHub
URL: https://github.com/italo-batista/lsh-semantic-similarity
Owner: italo-batista
Created: 2017-11-14T20:45:06.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2018-06-08T01:34:48.000Z (over 7 years ago)
Last Synced: 2025-07-01T03:02:55.131Z (3 months ago)
Topics: jaccard-similarity, lsh, textual-analysis, tutorial
Language: Python
Homepage:
Size: 9.77 KB
Stars: 15
Watchers: 1
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Locality Sensitive Hashing for semantic similarity

###### [![forthebadge](http://forthebadge.com/images/badges/made-with-python.svg)](http://forthebadge.com) vs 3.x

LSH (Locality Sensitive Hashing) is primarily used to find, given a large set of documents, the near-duplicates among them.
It can use hamming distance, jaccard coefficient, edit distance or other distance notion.

You can read the following tutorials if you want to understand more about it:

- [Ravi Kumar's work](https://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf)
- [Matti Lyra's work](https://mattilyra.github.io/2017/05/23/document-deduplication-with-lsh.html)
- [Insideops](https://insideops.wordpress.com/2015/07/30/similarity-search-and-hashing-for-text-documents/)

Although LSH is more to duplicated documents than to semantic similar ones, in this approach I make an effort to use LSH
to calculate semantic similarity among texts. For that, the algorithm extracts, using TFIDF, the text's main tokens
(or you can pre-calculate them and pass as param). Also, in this approach I use MinHash (which uses Jaccard similarity) as the
Similarity function.

**The overall aim is to reduce the number of comparisons needed to find similar items. LSH uses hash collisions to capture objects similarities.**
The hash collisions come in handy here as similar documents have a high probability of having the same hash value.
The probability of a hash collision for a minhash is exactly the Jaccard similarity of two sets.

See [this tutorial](tutorial.ipynb) to see how use this LSH!

Run as following to install dependencies:

```
python3 -m pip install -r requirements.txt
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/italo-batista/lsh-semantic-similarity

Awesome Lists containing this project

README