https://github.com/syed007hassan/investment-search
This application implements an advanced company search and ranking system using a hybrid search approach that combines vector similarity search (using PgVector) and traditional full-text search in PostgreSQL.
https://github.com/syed007hassan/investment-search
docker fastapi full-text-search gpt-4o hybrid-search openai pgvector postgresql react redis vector-search
Last synced: 7 months ago
JSON representation
This application implements an advanced company search and ranking system using a hybrid search approach that combines vector similarity search (using PgVector) and traditional full-text search in PostgreSQL.
- Host: GitHub
- URL: https://github.com/syed007hassan/investment-search
- Owner: Syed007Hassan
- Created: 2024-11-21T01:28:18.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-03-16T01:06:03.000Z (7 months ago)
- Last Synced: 2025-03-16T01:25:45.947Z (7 months ago)
- Topics: docker, fastapi, full-text-search, gpt-4o, hybrid-search, openai, pgvector, postgresql, react, redis, vector-search
- Language: Python
- Homepage:
- Size: 1.21 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Company Search & Ranking System with Hybrid Search
## Overview
This application implements an advanced company search and ranking system using a hybrid search approach that combines vector similarity search (using PgVector) and traditional full-text search in PostgreSQL. This dual approach ensures both semantic relevance and keyword accuracy in search results, making it particularly effective for company discovery and ranking.## Description
The system leverages a sophisticated hybrid search architecture that:
- Uses OpenAI or Pinecone embeddings to convert company descriptions into vector representations
- Implements PostgreSQL's full-text search capabilities for keyword matching
- Combines both approaches with a weighted scoring system for optimal ranking
- Utilizes GPT-4o or Groq's LLaMA models for intelligent search result processing and summarization
- Utilizes Redis for caching frequently accessed data to improve performanceThis hybrid approach provides more accurate and contextually relevant results compared to traditional keyword-only search systems.
## Technologies Used
- **Backend**: FastAPI
- **Database**: PostgreSQL with pgvector extension, Redis for caching
- **Vector Embeddings**: OpenAI API / Pinecone Inference API
- **LLM Processing**: OpenAI GPT-4o / Groq LLaMA models
- **Frontend**: React
- **Containerization**: Docker
- **ORM**: SQLAlchemy## Key Features
- Hybrid search combining vector similarity and full-text search
- Real-time company ranking based on search relevance
- Company information management (add/search) and retrieval
- LLM powered tool calling
- Docker-based application deployment
- Flexible embedding options (OpenAI or Pinecone)
- Choice of LLM providers (OpenAI or Groq)## Getting Started
### Prerequisites
- Docker and Docker Compose
- OpenAI API key (for OpenAI embeddings and LLM processing)
- Pinecone API key (for Pinecone embeddings)
- Groq API key (for Groq LLM models)### Environment Setup
1. Create a `.env` file in the Backend directory with the following variables:
```bash
DATABASE_NAME=your_database_name
DATABASE_USER=your_database_user
DATABASE_PASSWORD=your_database_password
DATABASE_URL=localhost
DATABASE_PORT=5432
OPENAI_API_KEY=your_openai_api_key
PINECONE_API_KEY=your_pinecone_api_key
GROQ_API_KEY=your_groq_api_key
```### Quick Start with Docker Compose
1. Clone the repository:
```bash
git clone https://github.com/Syed007Hassan/Investment-Search.git
cd Backend
```2. Build and start the containers:
```bash
docker-compose up --build
```
3. Access the application:
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000### Database Setup Options
The application provides several options for database setup:
1. **Reset Database and Load Sample Data**:
```yaml
command: >
bash -c "
python scripts/reset_db.py &&
python scripts/load_data.py &&
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
"
```2. **Load Sample Data Only**:
```yaml
command: >
bash -c "
python scripts/load_data.py &&
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
"
```3. **Start Application Only**:
```yaml
command: uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```## Technical Implementation
### Hybrid Search Architecture
The system implements a sophisticated hybrid search approach combining two powerful search methodologies:
1. **Vector Similarity Search (Semantic Search)**
- Uses either:
- OpenAI's text-embedding-3-small model (1536-dimensional vectors)
- Pinecone's multilingual-e5-large model (1024-dimensional vectors)
- Stores these vectors in PostgreSQL using the pgvector extension
- Enables semantic understanding of search queries2. **Full-Text Search (Keyword Search)**
- Utilizes PostgreSQL's built-in full-text search capabilities
- Performs exact and partial keyword matching### Embedding Options
The system supports two embedding providers:
1. **OpenAI Embeddings**
- Model: text-embedding-3-small
- Dimensions: 1536
- Advantages: High accuracy, well-suited for English language content2. **Pinecone Embeddings**
- Model: multilingual-e5-large
- Dimensions: 1024
- Advantages: Multilingual support, open-source model, cost-effectiveThe system automatically falls back to the alternative provider if one fails, ensuring robustness.
### LLM Processing Options
For search result processing and summarization, the system supports:
1. **OpenAI GPT Models**
- Model: gpt-4o-2024-08-06
- Advantages: High accuracy, advanced reasoning capabilities2. **Groq LLaMA Models**
- Model: llama-3.3-70b-versatile
- Advantages: Fast inference, cost-effective, open-source architecture### Search & Ranking Process
Let's break down this hybrid search query step by step:
1. **First CTE (Common Table Expression) - Vector Search:**
```sql
WITH vector_search AS (
SELECT id,
RANK () OVER (ORDER BY embedding <=> :embedding) AS rank
FROM "Company"
ORDER BY embedding <=> :embedding
LIMIT 20
)
```
- Creates a temporary result set named `vector_search`
- `embedding <=> :embedding`: Calculates cosine distance between stored embeddings and query embedding
- `RANK() OVER`: Assigns ranks based on similarity (lower distance = better rank)
- `LIMIT 20`: Takes top 20 most similar vectors
- Vector distance ranges from 0-2, where 0 means vectors are identical and 2 means opposite2. **Second CTE - Full-text Search:**
```sql
fulltext_search AS (
SELECT id,
RANK () OVER (ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC)
FROM "Company",
plainto_tsquery('english', :query) query
WHERE to_tsvector('english', content) @@ query
ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC
LIMIT 20
)
```
- Creates another temporary result set named `fulltext_search`
- `to_tsvector('english', content)`: Converts content to searchable tokens
- `plainto_tsquery('english', :query)`: Converts search query to search terms
- `@@`: Text search match operator
- `ts_rank_cd`: Calculates text search relevancy score (higher score means better match)
- `LIMIT 20`: Takes top 20 best text matches3. **Final Combined Query:**
```sql
SELECT
COALESCE(vector_search.id, fulltext_search.id) AS id,
COALESCE(1.0 / (:k + vector_search.rank), 0.0) +
COALESCE(1.0 / (:k + fulltext_search.rank), 0.0) AS score
FROM vector_search
FULL OUTER JOIN fulltext_search ON vector_search.id = fulltext_search.id
ORDER BY score DESC
LIMIT 20
```
- `FULL OUTER JOIN`: Combines results from both searches, keeping all matches from either
- `COALESCE` for IDs: Ensures we capture matches from either search method
- `COALESCE` for scoring: Handles cases where an item only matches one search type (defaults to 0)
- Score calculation uses k=60 as normalization factor to:
- Prevent division by zero
- Normalize scores to a comparable range
- Reduce impact of small rank differences
- `ORDER BY score DESC`: Ranks final results by combined score
- `LIMIT 20`: Returns top 20 combined results**Ranking Process:**
1. Vector ranking:
- Lower cosine distance = better rank
- Score = 1/(60 + rank)
- Example: Rank 1 = 1/61 ≈ 0.0164
- Lower distance is better because:
- Cosine distance measures how far apart two vectors are in high-dimensional space
- Distance of 0: Vectors are identical (perfect semantic match)
- Distance of 1: Vectors are perpendicular (unrelated content)
- Distance of 2: Vectors point in opposite directions (opposite meaning)
- Therefore, smaller distances indicate closer semantic similarity2. Text ranking:
- Higher ts_rank_cd = better rank
- Score = 1/(60 + rank)
- Example: Rank 2 = 1/62 ≈ 0.0161
- Higher ts_rank_cd is better because:
- It counts the number of matching terms
- Considers term frequency (how often terms appear)
- Weighs term proximity (how close terms are to each other)
- Accounts for term importance in the document
- Therefore, more matches and better quality matches result in higher scores3. Final ranking:
- Combined score = vector_score + text_score
- Higher combined score = better overall match
- Normalization ensures fair combination despite different scoring scales### Example
Consider the following example to illustrate the ranking process:- Item A: vector_rank=1, text_rank=2
- Score = 1/61 + 1/62 ≈ 0.0328
- Item B: vector_rank=5, text_rank=1
- Score = 1/65 + 1/61 ≈ 0.0317
- Result: Item A ranks higher than Item BThis hybrid approach ensures that results are ranked considering both semantic similarity (vectors) and keyword relevance (text), providing a more comprehensive search result.
### Demo
https://github.com/user-attachments/assets/bce8fc3b-45ef-4ae7-b94e-aad6b4dcc089
## OpenShift Deployment
### Deployment Steps
1. **Prepare OpenShift Resources**
- Convert Docker Compose configuration to OpenShift compatible resources using Kompose
- Ensure all required images are accessible to OpenShift2. **Configure Storage**
- Set up persistent volumes for PostgreSQL database
- Set up persistent volumes for Redis cache
- Configure volume claims for both databases3. **Configure Environment**
- Create secrets for sensitive data (API keys, database credentials)
- Create configmaps for application configuration
- Set up network policies if required4. **Deploy Components**
- Deploy PostgreSQL database with pgvector extension
- Deploy Redis cache service
- Deploy backend FastAPI application
- Deploy frontend React application5. **Configure Access**
- Create routes for frontend and backend services
- Configure TLS/SSL if required
- Set up any required network policiesNote: Ensure all components have appropriate resource limits and health checks configured.