Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/foxbenjaminfox/simil
CLI for semantic string similarity
https://github.com/foxbenjaminfox/simil
glove machine-learning python spacy string-similarity
Last synced: 5 days ago
JSON representation
CLI for semantic string similarity
- Host: GitHub
- URL: https://github.com/foxbenjaminfox/simil
- Owner: foxbenjaminfox
- License: gpl-3.0
- Created: 2019-06-29T17:39:13.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-07-24T13:17:16.000Z (over 5 years ago)
- Last Synced: 2024-04-24T15:43:40.120Z (7 months ago)
- Topics: glove, machine-learning, python, spacy, string-similarity
- Language: Python
- Size: 23.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Semantic String Similarity CLI
`simil` is a CLI interface to [`spacy`](https://spacy.io)'s string similarity engine. It uses the `en_vectors_web_lg` dataset to compare strings for their English semantic similarity. Given two words, phrases, or sentences, `simil` will tell you how similar their meanings are.
## Installation
First install `simil` itself:
```bash
$ pip3 install --user -U simil
```Now install one of spacy's web_vector models:
```
$ python3 -m spacy download en_vectors_web_lg
```You can choose between `en_vectors_web_lg`, `en_core_web_lg`, and `en_core_web_md`, (`en_core_web_sm` don't include word vectors at all, and can't be used with `simil`.) `simil` will use the largest model that you have installed, with preference for the `vectors` model over a `core` model.
I suggest using the large vectors model (`en_vectors_web_lg`), but you might want to use a smaller model in order to save on disk space or memory usage.
## Usage:
```bash
$ sim first_file.txt second_file.txt # compare two files
$ sim -s "first string" "second string" # compare two strings
```The output is a number between 0 and 1, representing how similar the two strings are.
## Details:
`simil` uses Spacy's word vector models trained with [`GLoVe`](https://nlp.stanford.edu/projects/glove/), such as [`en_vectors_web_lg`](https://spacy.io/models/en#en_vectors_web_lg).
This can be a large dataset, which makes for long startup times. So `simil` spins off a process in the background to hold the model, and works under a client-server model with it. This means that if you run `simil` a number of times in a row, only the first run is slow.
This background process does take up a fair bit of memory, typically around 2GB (for the `en_vectors_web_lg` model). After 10 minutes of inactivity it will automatically be killed, in order not to take up memory indefinitely. You can change the length of this timeout with the `--timeout` flag.