Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dhchenx/semantic-kit
A toolkit to estimate semantic similarity and relatedness between words/sentences
https://github.com/dhchenx/semantic-kit
semantic-analysis semantic-relatedness semantic-similarity text-analysis
Last synced: about 2 months ago
JSON representation
A toolkit to estimate semantic similarity and relatedness between words/sentences
- Host: GitHub
- URL: https://github.com/dhchenx/semantic-kit
- Owner: dhchenx
- License: mit
- Created: 2021-12-08T16:56:59.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-02-04T18:19:42.000Z (almost 3 years ago)
- Last Synced: 2024-11-29T07:41:31.351Z (about 2 months ago)
- Topics: semantic-analysis, semantic-relatedness, semantic-similarity, text-analysis
- Language: Python
- Homepage:
- Size: 94.6 MB
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Semantic Similarity and Relatedness Toolkit
A toolkit to estimate semantic similarity and relatedness between two words/sentences.
## Installation
```pip
pip install semantic-kit
```## Functions
1. Lesk algorithm and improved version
2. Similarity algorithms including WordNet , word2vec similarity, LDA, and googlenews-based methods
3. Distance algorithms like jaccard, soren, levenshtein, and their improved versions
4. Use Open Multilingual Wordnet to generate relevant keywords from multiple language## Examples
### Lesk Algorithm
```python
from semantickit.relatedness.lesk import lesk
from semantickit.relatedness.lesk_max_overlap import lesk_max_overlap
sent = ['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.']
m1, s1 = lesk(sent, 'bank', 'n')
m2, s2 = lesk_max_overlap(sent, 'bank', 'n')
print(m1,s1)
print(m2,s2)
```
### WordNet-based Similarity
```python
from semantickit.similarity.wordnet_similarity import wordnet_similarity_all
print(wordnet_similarity_all("dog.n.1","cat.n.1"))
```### Corpus-based Similarity
```python
from semantickit.similarity.word2vec_similarity import build_model,similarity_model
# build similarity model based on text source
build_model(data_path="text8",save_path="wiki_model")
# estimate similarity between words using the built model
sim=similarity_model("wiki_model","france","spain")
# print result
print("word2vec similarity: ",sim)
```### Pre-trained model-based Similarity
```python
from semantickit.similarity.googlenews_similarity import googlenews_similarity
data_path= r'GoogleNews-vectors-negative300.bin'
sim=googlenews_similarity(data_path,'human','people')
print(sim)
```## Weighted Levenshtein
```python
from semantickit.distance.n_gram.train_ngram import TrainNgram
from semantickit.distance.weighted_levenshtein import weighted_levenshtein,Build_TFIDF# train model
train_data_path = 'wlev/icd10_train.txt'
wordict_path = 'wlev/word_dict.model'
transdict_path = 'wlev/trans_dict.model'
words_path="wlev/dict_words.txt"
trainer = TrainNgram()
trainer.train(train_data_path, wordict_path, transdict_path)# build words tf-idf file
Build_TFIDF(train_data_path,words_path)# estimate weight lev distance
s0='颈结缔组织良性肿瘤'
s1='耳软骨良性肿瘤'
result=weighted_levenshtein(s0,s1, word_dict_path=wordict_path,trans_dict_path=transdict_path,data_path=train_data_path,words_path=words_path)
print(result)
```## License
The `Semantic-Kit` project is provided by [Donghua Chen](https://github.com/dhchenx).