https://github.com/ev2900/cosine_similarity_search_example

Example to help understand how to use hugging face sentance_transformers to encode searchable text into embeddings AND how to use cosine similarity search to determine similarity between a search prompt and the embeddings
https://github.com/ev2900/cosine_similarity_search_example

cosine-similarity huggingface-transformers python search similarity-search vector-search

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/ev2900/cosine_similarity_search_example
Owner: ev2900
Created: 2023-08-26T13:30:37.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2025-03-18T03:22:51.000Z (4 months ago)
Last Synced: 2025-03-27T08:45:11.287Z (3 months ago)
Topics: cosine-similarity, huggingface-transformers, python, search, similarity-search, vector-search
Language: Python
Homepage:
Size: 46.9 KB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Hugging Transformer Based Embeddings and Cosine Similarity Search Example

This example demonstrates how to transform text into embeddings via. Hugging Face sentence transform library. Once the text is represented as embeddings cosine similarity search can determine which embeddings are most similar to a search query

## What are Text Embeddings

Embeddings are numerical representations of text data that capture the semantic meaning and relationships between words, sentences, or documents. The numerical representation (vectors in a high-dimensional space) of the text data (sentences, paragraphs, documents ...) are constructed in a way that words or texts with similar meanings/contexts are represented closer to each other in the embedding space

There are several methods to generate text embeddings

1. **Word Embeddings** represent of individual words as vectors. Popular techniques include [Word2Vec](https://www.tensorflow.org/text/tutorials/word2vec), [GloVe](https://nlp.stanford.edu/projects/glove/) (Global Vectors for Word Representation), and [FastText](https://fasttext.cc/)

2. **Sentence and Document Embeddings** aim to represent the meaning of entire sentences or documents. Sentence/document embeddings are generated using methods like averaging the word embeddings, using recurrent neural networks (RNNs) or convolutional neural networks (CNNs) or more recently, using transformer based models

3. **Transformer Based Embeddings** models, such as [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)), [GPT](https://en.wikipedia.org/wiki/GPT-3) can create contextual embeddings. These embeddings consider the surrounding context of each word in a sentence and can result in richer, more nuanced representations

The example in this repository uses a transformer based approach for converting text to embeddings. The example uses the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) provided by hugging face.

## What is Cosine Similarity Search

Cosine similarity measures the cosine of the angle between two vectors. This indicates how similar their directions are regardless of their magnitudes. Explained simply cosine similarity can quantify how similar two vectors are.

In search applications if we represent our search query as vector(s) and our search-able text as vector(s). we can determine which search-able text is most similar to our search query.

## Example in Python

### Install sentence_transformers library

```pip install sentence_transformers``` or in a Juypter notebook run ```!pip install sentence_transformers```

To confirm that the installation was successful import the BM250kapi item from the rank_bm25 library run

```from sentence_transformers import SentenceTransformer, util```

### Create corpus of documents

In this example we have a corpus of 4 documents

```
corpus = [
"does this work with xbox?",
"Does the M70 work with Android phones?",
"does this work with iphone?",
"Can this work with an xbox "
]
```

### Download transformer model and encode documents
```
model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = model.encode(corpus)
```

### Encode a search query
```
search_query = "M70 Android"
search_query_encode = model.encode(search_query)
```

### Compute cosine similarity between search query and all sentencase
```
cos_sim = util.cos_sim(search_query_encode, embeddings)
```

# Return the most similar document to the search query
```
similarity_scores = []
for i in range(list(cos_sim.size())[1]):
similarity_scores.append(cos_sim[0][i].item())

print(corpus[similarity_scores.index(max(similarity_scores))])

---- result below ----

Does the M70 work with Android phones?

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ev2900/cosine_similarity_search_example

Awesome Lists containing this project

README