https://github.com/siddhant-k-code/s3-vectors-benchmark
https://github.com/siddhant-k-code/s3-vectors-benchmark
Last synced: 28 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/siddhant-k-code/s3-vectors-benchmark
- Owner: Siddhant-K-code
- License: mit
- Created: 2025-11-02T09:25:09.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-11-02T09:25:09.000Z (8 months ago)
- Last Synced: 2026-05-30T02:26:56.108Z (29 days ago)
- Language: Python
- Size: 43.9 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# S3 Vectors Benchmark
A comprehensive benchmark suite for comparing Amazon S3 Vectors against FAISS, NMSLib, and brute-force search methods at scale (10K to 10M vectors).
## Overview
This project provides a complete framework for benchmarking vector similarity search performance across different methods:
- **Amazon S3 Vectors** - AWS managed vector database
- **FAISS** - Facebook AI Similarity Search (HNSW index)
- **NMSLib** - Non-Metric Space Library (HNSW index)
- **Brute-force** - Baseline cosine similarity search
The benchmark evaluates:
- **Query latency** across different vector counts
- **Search accuracy** (Recall@K) using UKBench dataset
- **Scalability** from 10K to 10M vectors
- **Memory efficiency** and resource usage
## Features
- 🚀 **Multiple Vector Databases**: Support for S3 Vectors, FAISS, NMSLib, and brute-force
- 📊 **Comprehensive Metrics**: Query latency, recall, precision, and more
- 📈 **Visualization**: Automatic chart generation for results analysis
- 🔄 **Resume Capability**: Checkpoint support for long-running benchmarks
- 💾 **Embedding Caching**: Efficient storage and retrieval of embeddings
- 🎯 **UKBench Dataset**: Standard evaluation dataset with ground truth
- ⚙️ **Configurable**: YAML-based configuration for all parameters
## Prerequisites
- **Python**: 3.9 or higher
- **uv**: Fast Python package installer (recommended) or pip as fallback
- **AWS Account**: For S3 Vectors testing
- **Storage**: Sufficient disk space for datasets (~10-50 GB)
- **Memory**: 8 GB+ RAM recommended
- **GPU** (optional): For faster embedding generation
## Installation
### 1. Clone the repository
```bash
git clone https://github.com/Siddhant-K-code/s3-vectors-benchmark.git
cd s3-vectors-benchmark
```
### 2. Install uv (if not already installed)
```bash
# macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Or with pip
pip install uv
```
### 3. Install dependencies with uv
**Option A: Using uv sync (recommended)**
```bash
# Install project and dependencies (creates .venv automatically)
uv sync
# Or with dev dependencies for testing
uv sync --dev
```
**Option B: Manual virtual environment**
```bash
# Create virtual environment
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install project in editable mode with dependencies
uv pip install -e .
# Or install with dev dependencies
uv pip install -e ".[dev]"
```
### 4. Configure AWS credentials
```bash
aws configure
```
Or set environment variables:
```bash
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
```
### 5. Create configuration file
```bash
cp config.yaml.example config.yaml
# Edit config.yaml with your settings
```
Required settings in `config.yaml`:
- AWS region and S3 bucket name
- Dataset cache directories
- Benchmark parameters
## Quick Start
### 1. Download and prepare datasets
```bash
uv run python src/main.py prepare-data --dataset ukbench --download
uv run python src/main.py prepare-data --dataset coco --download
```
### 2. Generate embeddings
```bash
uv run python src/main.py generate-embeddings --model vit-s --dimension 384
```
This will:
- Load images from UKBench dataset
- Generate embeddings using DINOv2-small model
- Cache embeddings to HDF5 file
### 3. Run benchmark
```bash
uv run python src/main.py benchmark \
--embeddings data/embeddings/vit-s/embeddings_384d.h5 \
--vectors 10200 100000 1000000 \
--methods s3_vectors faiss nmslib \
--quick
```
The `--quick` flag runs a smaller test with fewer queries.
### 4. Generate visualizations
```bash
uv run python src/main.py visualize --latest
```
This generates three charts:
- `processing_time_ratio.png` - Processing time normalized to smallest dataset
- `search_accuracy.png` - Recall@K across different vector counts
- `processing_time_ms.png` - Query latency in milliseconds (S3 Vectors)
## Usage
### Data Preparation
Download and prepare datasets:
```bash
# UKBench only
uv run python src/main.py prepare-data --dataset ukbench --download
# COCO only
uv run python src/main.py prepare-data --dataset coco --download
# All datasets
uv run python src/main.py prepare-data --dataset all --download
```
### Embedding Generation
Generate embeddings with different models:
```bash
# DINOv2-small (384-dim)
uv run python src/main.py generate-embeddings --model vit-s
# DINOv2-base (768-dim)
uv run python src/main.py generate-embeddings --model vit-b
# DINOv2-large (1024-dim)
uv run python src/main.py generate-embeddings --model vit-l
```
### Benchmark Execution
Run full benchmark suite:
```bash
uv run python src/main.py benchmark \
--embeddings data/embeddings/vit-s/embeddings_384d.h5 \
--vectors 10200 100000 500000 1000000 10000000 \
--dimensions 384 \
--methods s3_vectors faiss nmslib bruteforce \
--output results/full_benchmark.json
```
Options:
- `--vectors`: Vector counts to test (multiple values)
- `--dimensions`: Vector dimensions to test
- `--methods`: Methods to benchmark (s3_vectors, faiss, nmslib, bruteforce)
- `--quick`: Quick test with fewer queries
- `--dry-run`: Estimate time/resources without running
### Testing S3 Vectors
Test S3 Vectors connection and basic operations:
```bash
uv run python src/main.py test-s3 \
--embeddings data/embeddings/vit-s/embeddings_384d.h5 \
--vectors 1000000 \
--dimension 384
```
### Visualization
Generate charts from results:
```bash
# Use latest results file
uv run python src/main.py visualize --latest
# Specify results file
uv run python src/main.py visualize \
--results-file results/benchmark_results_20250102_120000.json \
--output-dir results/charts
```
## Project Structure
```
s3-vectors-benchmark/
├── README.md # This file
├── pyproject.toml # Modern Python package configuration (uv)
├── requirements.txt # Legacy pip requirements (for reference)
├── setup.py # Package setup (legacy)
├── config.yaml.example # Configuration template
├── .gitignore # Git ignore rules
├── .python-version # Python version specification
├── src/ # Source code
│ ├── __init__.py
│ ├── main.py # CLI entry point
│ ├── config.py # Configuration management
│ ├── data_loader.py # Dataset loading
│ ├── embeddings.py # Embedding generation
│ ├── benchmark.py # Benchmark orchestration
│ ├── evaluate.py # Accuracy evaluation
│ ├── visualize.py # Chart generation
│ ├── utils.py # Utility functions
│ └── vector_dbs/ # Vector database implementations
│ ├── base.py # Abstract base class
│ ├── s3_vectors.py # S3 Vectors implementation
│ ├── faiss_db.py # FAISS implementation
│ ├── nmslib_db.py # NMSLib implementation
│ └── bruteforce.py # Brute-force baseline
├── tests/ # Test suite
├── notebooks/ # Jupyter notebooks
├── docs/ # Documentation
├── data/ # Datasets (git-ignored)
└── results/ # Results (git-ignored)
```
## Configuration
Edit `config.yaml` to customize:
### AWS Configuration
```yaml
aws:
region: us-east-1
profile: default
bucket_name: your-vector-bucket-name
```
### S3 Vectors Configuration
```yaml
s3_vectors:
index_name: benchmark-index
metric_type: cosine # or euclidean
batch_size: 500
```
### Benchmark Configuration
```yaml
benchmark:
vector_counts: [10200, 100000, 500000, 1000000, 10000000]
dimensions: [384, 768, 1024]
topk: 5
num_queries: 100
num_repeats: 3
```
## Results Analysis
After running benchmarks, results are saved to JSON files with:
- **Raw measurements**: Query latency, result IDs, similarities
- **Evaluation metrics**: Recall@K, Precision@K, aggregated statistics
- **Metadata**: Configuration, timestamps, vector counts
### Reading Results
Results are stored in JSON format:
```python
import json
with open("results/benchmark_results_20250102.json", "r") as f:
results = json.load(f)
# Access evaluation metrics
evaluation = results["evaluation"]
for config_key, metrics in evaluation.items():
print(f"{config_key}:")
print(f" Recall@K: {metrics['recall_at_k']['mean']:.3f}")
print(f" Query time: {metrics['query_time_ms']['mean']:.2f} ms")
```
### Charts
Charts are automatically generated showing:
1. **Processing Time Ratio**: Normalized query latency across methods
2. **Search Accuracy**: Recall@K across vector counts
3. **Processing Time (ms)**: S3 Vectors query latency
## Troubleshooting
### AWS Credentials Not Found
```bash
# Verify credentials
aws sts get-caller-identity
# Or set environment variables
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
```
### S3 Bucket Not Found
Ensure the bucket exists and is accessible:
```bash
aws s3 ls s3://your-bucket-name
```
### Out of Memory
For large datasets:
- Reduce `batch_size` in embeddings config
- Process in smaller chunks
- Use GPU for embedding generation
### Dataset Download Fails
Some datasets may require manual download. Check:
- Network connectivity
- Sufficient disk space
- Dataset availability
### uv Installation Issues
If `uv` is not found:
- Ensure it's in your PATH
- Restart your terminal after installation
- Use `pip install uv` as fallback
## Contributing
Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests (run with `uv run pytest`)
5. Submit a pull request
## License
MIT License - see LICENSE file for details
## Citation
If you use this benchmark in your research, please cite:
```bibtex
@software{s3_vectors_benchmark,
title={S3 Vectors Benchmark},
author={Siddhant Khare},
year={2025},
url={https://github.com/siddhant-k-code/s3-vectors-benchmark}
}
```
## Acknowledgments
- UKBench dataset: http://vis.uky.edu/~stewe/ukbench/
- Microsoft COCO dataset: https://cocodataset.org/
- DINOv2 models: https://github.com/facebookresearch/dinov2
- FAISS: https://github.com/facebookresearch/faiss
- NMSLib: https://github.com/nmslib/nmslib
## Support
For issues and questions:
- Open an issue on GitHub
- Check documentation in `docs/` directory
- Review example notebooks in `notebooks/`