https://github.com/manav54321/models

chatmodel cosine-similarity embedding-models langchain llms streamlit transformers

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/manav54321/models
Owner: Manav54321
License: mit
Created: 2025-06-26T10:44:10.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-07-03T09:51:42.000Z (about 1 year ago)
Last Synced: 2025-09-01T20:20:47.368Z (10 months ago)
Topics: chatmodel, cosine-similarity, embedding-models, langchain, llms, streamlit, transformers
Language: Python
Homepage:
Size: 14.3 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Document Similarity Search with Sentence Transformers

This repository presents a minimal and effective implementation of semantic document retrieval using vector embeddings and cosine similarity. The system is designed to identify the document that best matches the meaning of a user's input query.

---

## Overview

Traditional search methods rely on keyword overlap. This method improves upon that by leveraging **pretrained transformer-based sentence embeddings**, allowing retrieval based on *semantic similarity* rather than lexical similarity.

The process involves:

* Embedding each document and query into a high-dimensional space.
* Computing cosine similarity between the query vector and document vectors.
* Returning the document with the highest semantic alignment.

---

## Mathematical Foundation

### 1. Vector Representation

Each document `d_i` and query `q` is passed through a transformer model `f`, which encodes the text into an embedding vector:

d_i = f(d_i), q = f(q)

Where `f` maps raw text to ℝ^384 using a pre-trained model.

---

### 2. Cosine Similarity

Similarity between query and document embeddings is calculated as:

cos_sim(q, d_i) = (q · d_i) / (||q|| * ||d_i||)

This metric reflects the angle between vectors. Higher values indicate greater similarity in meaning.

### 3. Ranking Mechanism

All cosine scores are computed, sorted, and the top document is returned:

```python
sorted_scores = sorted(enumerate(scores), key=lambda x: x[1])
document_index, score = sorted_scores[-1]
```

---

## Embedding Model

* **Model Name**: `sentence-transformers/all-MiniLM-L6-v2`
* **Architecture**: Distilled bi-encoder (Transformer-based)
* **Embedding Dimension**: 384
* **Provider**: HuggingFace Model Hub

---

## Example Run

```python
query = "Kylie"
```

Returns:

```
query: Kylie
best_match: Kylie Jenner commands attention with her signature style, glamorous beauty, and trendsetting influence on social media.
similarity_score: 0.7823
```
![Screenshot 2025-07-03 at 3 18 51 PM](https://github.com/user-attachments/assets/0c48f091-cf57-44b9-800c-9b3f757cb3c2)

---

## Dependencies

Install required libraries:

```bash
pip install -r requirements.txt
```

---

## Run the Script

```bash
python /Users/manavdesai/Desktop/GitHub/Langchain_Models/document_search_app.py.py
```

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/manav54321/models

Awesome Lists containing this project

README