https://github.com/hscspring/sto
MinHash and LSH Based Store and Query.
https://github.com/hscspring/sto
dawg lsh-ensemble minhash nlp
Last synced: 3 months ago
JSON representation
MinHash and LSH Based Store and Query.
- Host: GitHub
- URL: https://github.com/hscspring/sto
- Owner: hscspring
- License: mit
- Created: 2020-05-12T12:36:59.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-06-22T02:00:09.000Z (over 3 years ago)
- Last Synced: 2025-07-09T11:03:55.467Z (3 months ago)
- Topics: dawg, lsh-ensemble, minhash, nlp
- Language: Python
- Homepage:
- Size: 9.77 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Similar Text run only Once
## Install
```bash
$ pip install -r requirements.txt
$ python setup.py install
```## Usage
```python
from sto import Sto, Tokenizerst = Sto(value_format='hh', # default 'h', means a short integer value.
threshold=0.8, # default 0.8, similarity threshold
num_perm=128, # default 128
num_part=32, # default 32
tokenizer=Tokenizer('zh') # default Tokenizer('zh')
)
# Store the model result
value_list = []
for text in text_list:
# r1, r2, ... should be int, just easy to store
r1 = model1(text)
r2 = model2(text)
# should be a tuple(short int, short int), this is what the format 'hh' means.
values = (r1, r2)
value_list.append(values)
st.store(text_list, value_list)# Query if the given text LSH is similar (of course the same) to the stored text.
values = st.query(text)
```注意:
- 使用 add 批量添加时只会去重完本完全一致的,不会用相似度去重。
- value format: [struct — Interpret bytes as packed binary data — Python 3.8.3rc1 documentation](https://docs.python.org/3/library/struct.html#format-strings)
- threshold, num perm and num part: [MinHash LSH Ensemble — datasketch 1.0.0 documentation](http://ekzhu.com/datasketch/lshensemble.html)如果需要用 Cython 版的 Ngram,可以在 ngram 目录下编译:
```bash
$ python setup.py build_ext --inplace
```如果需要用 cppjieba 分词,可以直接安装:
```bash
$ pip install cppjieba
```