https://github.com/rahulunair/seekvec
Seekvec - A lightning-fast similarity search engine for embedding vectors.
https://github.com/rahulunair/seekvec
hnsw lsh vector-search
Last synced: 7 months ago
JSON representation
Seekvec - A lightning-fast similarity search engine for embedding vectors.
- Host: GitHub
- URL: https://github.com/rahulunair/seekvec
- Owner: rahulunair
- Created: 2024-11-03T21:14:01.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-11-03T21:33:04.000Z (11 months ago)
- Last Synced: 2025-01-19T21:48:22.961Z (9 months ago)
- Topics: hnsw, lsh, vector-search
- Language: Python
- Homepage:
- Size: 34.8 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ⥠seekvec
A lightning-fast similarity search engine for embedding vectors. Build efficient search capabilities without heavyweight vector databases. Perfect for when you want something light and fast! ðââïļ
[](https://github.com/rahulunair/seekvec/stargazers)
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
[](https://www.rust-lang.org/)## ðŊ Purpose
Ever felt vector databases were overkill for your embedding search needs? You're not alone! This project shows how to implement efficient similarity search when:
- You have up to millions of embeddings ð
- Need blazing-fast approximate nearest neighbor search âĄ
- Want a lightweight solution (no heavy dependencies!) ðŠķ
- Require high performance with minimal resource usage ðŠ## âĻ Features
### ð ïļ Multiple Implementations
- Rust LSH implementation (optimized for speed) ðĶ
- Python LSH implementation (easy to use) ð
- Python HNSW implementation (for comparison) ð
- Embedding generation utilities ðŪ### ðĻ Smart Optimizations
- Bloom filters for super-efficient candidate filtering ðļ
- Multi-probe LSH for better recall without the overhead ðŊ
- Adaptive early termination (because why wait?) âąïļ
- Angular LSH optimized for cosine similarity ð
- Smart data structures for zippy retrieval ðââïļ## ðĶ Installation
### Rust Implementation
```bash
# Clone the repository
git clone https://github.com/rahulunair/seekvec
cd seekvec# Build the Rust implementation
cargo build --release
```### Python Implementation
```bash
# Install required packages
pip install -r requirements.txt
```## ðŪ Usage
### 1. Generate Your Embeddings
```bash
# Generate sample embeddings
python script/gen_embeddings.py
```This will:
- ð Load sample data from AG News dataset
- ðĪ Generate embeddings using SentenceTransformer
- ðū Create both main and query embedding files### 2. Run the Search
#### ðĶ Using Rust LSH:
```bash
cargo run --release
```#### ð Using Python LSH:
```bash
python script/lsh_search.py
```#### ð Using Python HNSW:
```bash
python script/hnsw_search.py
```## ðŊ Implementation Details
### ðĻ LSH Implementation
Our LSH implementation uses clever angular hashing optimized for cosine similarity:
- ðē Smart random projection generation
- ðŊ Multi-probe LSH with early stopping
- ðļ Bloom filters for speed
- ⥠SIMD-optimized vector operations (Rust)### ðģ HNSW Implementation
The HNSW implementation provides:
- ð Hierarchical graph structure
- ðïļ Efficient graph construction
- ð Logarithmic search complexity## ð Performance
Tested on 1M 768-dimensional embeddings:
- ⥠Query time: ~10-50ms
- ðū Memory usage: ~2-4GB
- ðïļ Build time: ~5-10 minutes
- ðŊ Recall@10: ~0.8-0.9## ð§ Configuration
Key parameters (all auto-tuned but configurable):
```python
# LSH Parameters
num_hash_tables = 10 # More tables = better recall, more memory
hash_size = 16 # Larger size = better precision, slower search# HNSW Parameters
M = 16 # More connections = better recall, more memory
ef_construction = 200 # Higher ef = better index quality, slower build
```## ðĪ Contributing
Want to make this even better? Contributions are welcome! Areas for improvement:
- ð Additional similarity metrics
- ⥠More optimization techniques
- ð Benchmarking tools
- ð Documentation improvements## ð License
MIT License - Go wild! ð
## ð Acknowledgments
- ð Inspired by various LSH and HNSW papers
- ðĶ Built with love using Rust and Python
- ðĪ Tested with Sentence Transformers
- ð Special thanks to the open-source community## ðŽ Contact
- GitHub: [@rahulunair](https://github.com/rahulunair)
- Repository: [seekvec](https://github.com/rahulunair/seekvec)---
⥠Made with love for the ML community âĄ