https://github.com/dselivanov/lshr
Locality Sensitive Hashing In R
https://github.com/dselivanov/lshr
approximate-nearest-neighbor-search locality-sensitive-hashing minhash random-projections
Last synced: about 1 month ago
JSON representation
Locality Sensitive Hashing In R
- Host: GitHub
- URL: https://github.com/dselivanov/lshr
- Owner: dselivanov
- License: other
- Created: 2015-06-10T21:12:36.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2019-01-11T10:53:48.000Z (over 6 years ago)
- Last Synced: 2025-04-14T10:12:59.436Z (about 1 month ago)
- Topics: approximate-nearest-neighbor-search, locality-sensitive-hashing, minhash, random-projections
- Language: R
- Size: 98.6 KB
- Stars: 40
- Watchers: 4
- Forks: 13
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Locality Sensitive Hashing in R
LSHR - fast and memory efficient package for near-neighbor search in high-dimensional data. Two LSH schemes implemented at the moment:1. Minhashing for jaccard similarity
2. Sketching (or random projections) for cosine similarity.
Most of ideas are based on brilliant [Mining of Massive Datasets](http://www.mmds.org) book.# Materials
* [Slides](http://www.slideshare.net/MailRuGroup/okru-finding-similar-items-in-highdimensional-spaces-locality-sensitive-hashing) (in english) and [video](https://youtu.be/ko0a0Z75oZQ?list=PLcJ8pdaABCSk1dNtpgaHvuV5y2gWItuUO) (in russian) from my talk at Moscow Data Science meetup.
# Quick reference
```R
# devtools::install_github('dselivanov/text2vec')
library(text2vec)
library(LSHR)
data("movie_review")
it <- itoken(movie_review$review, preprocess_function = tolower, tokenizer = word_tokenizer)
dtm <- create_dtm(it, hash_vectorizer())
dtm = as(dtm, "RsparseMatrix")hashfun_number = 120
s_curve <- get_s_curve(hashfun_number, n_bands_min = 5, n_rows_per_band_min = 5)
# Examine S-curve.
# Find tradeoff between accuracy and false-positive rate.
```

```R
seed = 1
pairs = get_similar_pairs(dtm, bands_number = 10, rows_per_band = 32, distance = 'cosine', seed = seed)pairs[order(-N)]
# id1 id2 N
# 1: 1054 1417 10
# 2: 1084 3462 10
# 3: 1291 1356 10
# 4: 1615 3846 10
# 5: 2805 4763 4
# ---
# 2304: 4767 4961 1
# 2305: 4772 4776 1
# 2306: 4810 4859 1
# 2307: 4854 4945 1
# 2308: 4905 4918 1
```