https://github.com/goldpulpy/pysentence-similarity
PySentence-Similarity is a tool designed to identify and find similarities between sentences and a base sentence, expressed as a percentage 📊.
https://github.com/goldpulpy/pysentence-similarity
neural-network package pip sentence-embeddings sentence-similarity sentence-similarity-score similarity similarity-score similarity-search
Last synced: 12 days ago
JSON representation
PySentence-Similarity is a tool designed to identify and find similarities between sentences and a base sentence, expressed as a percentage 📊.
- Host: GitHub
- URL: https://github.com/goldpulpy/pysentence-similarity
- Owner: goldpulpy
- License: mit
- Created: 2024-10-05T00:45:16.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-01-07T01:49:46.000Z (4 months ago)
- Last Synced: 2025-04-14T00:54:48.536Z (16 days ago)
- Topics: neural-network, package, pip, sentence-embeddings, sentence-similarity, sentence-similarity-score, similarity, similarity-score, similarity-search
- Language: Python
- Homepage: https://pypi.org/project/pysentence-similarity/
- Size: 60.5 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
PySentence-Similarity 😊
## Information
**PySentence-Similarity** is a tool designed to identify and find similarities between sentences and a base sentence, expressed as a percentage 📊. It compares the semantic value of each input sentence to the base sentence, providing a score that reflects how related or similar they are. This tool is useful for various natural language processing tasks such as clustering similar texts 📚, paraphrase detection 🔍 and textual consequence measurement 📈.
The models were converted to ONNX format to optimize and speed up inference. Converting models to ONNX enables cross-platform compatibility and optimized hardware acceleration, making it more efficient for large-scale or real-world applications 🚀.
- **High accuracy:** Utilizes a robust Transformer-based architecture, providing high accuracy in semantic similarity calculations 🔬.
- **Cross-platform support:** The ONNX format provides seamless integration across platforms, making it easy to deploy across environments 🌐.
- **Scalability:** Efficient processing can handle large datasets, making it suitable for enterprise-level applications 📈.
- **Real-time processing:** Optimized for fast output, it can be used in real-world applications without significant latency ⏱️.
- **Flexible:** Easily adaptable to specific use cases through customization or integration with additional models or features 🛠️.
- **Low resource consumption:** The model is designed to operate efficiently, reducing memory and CPU/GPU requirements, making it ideal for resource-constrained environments ⚡.
- **Fast and user-friendly:** The library offers high performance and an intuitive interface, allowing users to quickly and easily integrate it into their projects 🚀.## Installation 📦
- **Requirements:** Python 3.8 or higher.
```bash
# install from PyPI
pip install pysentence-similarity# install from GitHub
pip install git+https://github.com/goldpulpy/pysentence-similarity.git
```## Support models 🤝
You don't need to download anything; the package itself will download the model and its tokenizer from a special HF [repository](https://huggingface.co/goldpulpy/pysentence-similarity).
Below are the models currently added to the special repository, including their file size and a link to the source.
| Model | Parameters | FP32 | FP16 | INT8 | Source link |
| ------------------------------------- | ---------- | ------ | ----- | ----- | ------------------------------------------------------------------------------------------- |
| paraphrase-albert-small-v2 | 11.7M | 45MB | 22MB | 38MB | [HF](https://huggingface.co/sentence-transformers/paraphrase-albert-small-v2) 🤗 |
| all-MiniLM-L6-v2 | 22.7M | 90MB | 45MB | 23MB | [HF](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) 🤗 |
| paraphrase-MiniLM-L6-v2 | 22.7M | 90MB | 45MB | 23MB | [HF](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) 🤗 |
| multi-qa-MiniLM-L6-cos-v1 | 22.7M | 90MB | 45MB | 23MB | [HF](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) 🤗 |
| msmarco-MiniLM-L-6-v3 | 22.7M | 90MB | 45MB | 23MB | [HF](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L-6-v3) 🤗 |
| all-MiniLM-L12-v2 | 33.4M | 127MB | 65MB | 32MB | [HF](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) 🤗 |
| gte-small | 33.4M | 127MB | 65MB | 32MB | [HF](https://huggingface.co/thenlper/gte-small) 🤗 |
| all-distilroberta-v1 | 82.1M | 313MB | 157MB | 79MB | [HF](https://huggingface.co/sentence-transformers/all-distilroberta-v1) 🤗 |
| all-mpnet-base-v2 | 109M | 418MB | 209MB | 105MB | [HF](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) 🤗 |
| multi-qa-mpnet-base-dot-v1 | 109M | 418MB | 209MB | 105MB | [HF](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) 🤗 |
| paraphrase-multilingual-MiniLM-L12-v2 | 118M | 449MB | 225MB | 113MB | [HF](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) 🤗 |
| text2vec-base-multilingual | 118M | 449MB | 225MB | 113MB | [HF](https://huggingface.co/shibing624/text2vec-base-multilingual) 🤗 |
| distiluse-base-multilingual-cased-v1 | 135M | 514MB | 257MB | 129MB | [HF](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) 🤗 |
| paraphrase-multilingual-mpnet-base-v2 | 278M | 1.04GB | 530MB | 266MB | [HF](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) 🤗 |
| gte-multilingual-base | 305M | 1.17GB | 599MB | 324MB | [HF](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) 🤗 |
| gte-large | 335M | 1.25GB | 640MB | 321MB | [HF](https://huggingface.co/thenlper/gte-large) 🤗 |
| all-roberta-large-v1 | 355M | 1.32GB | 678MB | 340MB | [HF](https://huggingface.co/sentence-transformers/all-roberta-large-v1) 🤗 |
| LaBSE | 470M | 1.75GB | 898MB | 450MB | [HF](https://huggingface.co/sentence-transformers/LaBSE) 🤗 |**PySentence-Similarity** supports `FP32`, `FP16`, and `INT8` dtypes.
- **FP32:** 32-bit floating-point format that provides high precision and a wide range of values.
- **FP16:** 16-bit floating-point format, reducing memory consumption and computation time, with minimal loss of precision (typically less than 1%).
- **INT8:** 8-bit integer quantized format that greatly reduces model size and speeds up output, ideal for resource-constrained environments, with little loss of precision.## Usage examples 📖
### Compute similarity score 📊
Let's define the similarity score as the percentage of how similar the sentences are to the original sentence (0.75 = 75%), default compute function is `cosine`
You can use CUDA 12.X by passing the `device='cuda'` parameter to the Model object; the default is `cpu`. If the device is not available, it will automatically be set to `cpu`.
```python
from pysentence_similarity import Model
from pysentence_similarity.utils import compute_score# Create an instance of the model all-MiniLM-L6-v2; the default dtype is `fp32`
model = Model("all-MiniLM-L6-v2", dtype="fp16")sentences = [
"This is another test.",
"This is yet another test.",
"We are testing sentence similarity."
]# Convert sentences to embeddings
# The default is to use mean_pooling as a pooling function
source_embedding = model.encode("This is a test.")
embeddings = model.encode(sentences, progress_bar=True)# Compute similarity scores
# The rounding parameter allows us to round our float values
# with a default of 2, which means 2 decimal places.
compute_score(source_embedding, embeddings)
# Return: [0.86, 0.77, 0.48]
````compute_score` returns in the same index order in which the embedding was encoded.
Let's see the sentence and its evaluation from a computational function
```python
# Compute similarity scores
scores = compute_score(source_embedding, embeddings)for sentence, score in zip(sentences, scores):
print(f"{sentence} ({score})")# Output prints:
# This is another test. (0.86)
# This is yet another test. (0.77)
# We are testing sentence similarity. (0.48)
```You can use the computational functions: `cosine`, `euclidean`, `manhattan`, `jaccard`, `pearson`, `minkowski`, `hamming`, `kl_divergence`, `chebyshev`, `bregman` or your custom function
```python
from pysentence_similarity.compute import euclideancompute_score(source_embedding, embeddings, compute_function=euclidean)
# Return: [2.52, 3.28, 5.62]
```You can use `max_pooling`, `mean_pooling`, `min_pooling` or your custom function
```python
from pysentence_similarity.pooling import max_poolingsource_embedding = model.encode("This is a test.", pooling_function=max_pooling)
embeddings = model.encode(sentences, pooling_function=max_pooling)
...
```### Search similar sentences 🔍
```python
from pysentence_similarity import Model
from pysentence_similarity.utils import search_similar# Create an instance of the model
model = Model("all-MiniLM-L6-v2", dtype="fp16")# Test text
sentences = [
"Hello my name is Bob.",
"I love to eat pizza.",
"We are testing sentence similarity.",
"Today is a sunny day.",
"London is the capital of England.",
"I am a student at Stanford University."
]# Convert query sentence to embedding
query_embedding = model.encode("What's the capital of England?")# Convert sentences to embeddings
embeddings = model.encode(sentences)# Search similar sentences
similar = search_similar(
query_embedding=query_embedding,
sentences=sentences,
embeddings=embeddings,
top_k=3 # number of similar sentences to return
)# Print similar sentences
for idx, (sentence, score) in enumerate(similar, start=1):
print(f"{idx}: {sentence} ({score})")# Output prints:
# 1: London is the capital of England. (0.81)
# 2: Hello my name is Bob. (0.06)
# 3: I love to eat pizza. (0.05)
```With use storage
```python
from pysentence_similarity import Model, Storage
from pysentence_similarity.utils import search_similarmodel = Model("all-MiniLM-L6-v2", dtype="fp16")
query_embedding = model.encode("What's the capital of England?")storage = Storage.load("my_storage.h5")
similar = search_similar(
query_embedding=query_embedding,
storage=storage,
top_k=3
)
...
```### Splitting ✂️
```python
from pysentence_similarity import Splitter# Default split markers: '\n'
splitter = Splitter()# If you want to separate by specific characters.
splitter = Splitter(markers_to_split=["!", "?", "."], preserve_markers=True)# Test text
text = "Hello world! How are you? I'm fine."# Split from text
splitter.split_from_text(text)
# Return: ['Hello world!', 'How are you?', "I'm fine."]
```At this point, sources for the splitting are available: text, file, URL, CSV, and JSON.
### Storage 💾
The storage allows you to save and link sentences and their embeddings for easy access, so you don't need to encode a large corpus of text every time. The storage also enables similarity searching.
The storage must store the **sentences** themselves and their **embeddings**.
```python
from pysentence_similarity import Model, Storage# Create an instance of the model
model = Model("all-MiniLM-L6-v2", dtype="fp16")# Create an instance of the storage
storage = Storage()
sentences = [
"This is another test.",
"This is yet another test.",
"We are testing sentence similarity."
]# Convert sentences to embeddings
embeddings = model.encode(sentences)# Add sentences and their embeddings
storage.add(sentences, embeddings)# Save the storage
storage.save("my_storage.h5")
```Load from the storage
```python
from pysentence_similarity import Model, Storage
from pysentence_similarity.utils import compute_score# Create an instance of the model and storage
model = Model("all-MiniLM-L6-v2", dtype="fp16")
storage = Storage.load("my_storage.h5")# Convert sentence to embedding
source_embedding = model.encode("This is a test.")# Compute similarity scores with the storage
compute_score(source_embedding, storage)
# Return: [0.86, 0.77, 0.48]
```## License 📜
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details
Created by goldpulpy with ❤️