https://github.com/siddhant-k-code/s3-vectors-benchmark

Last synced: 28 days ago
JSON representation

Host: GitHub
URL: https://github.com/siddhant-k-code/s3-vectors-benchmark
Owner: Siddhant-K-code
License: mit
Created: 2025-11-02T09:25:09.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-11-02T09:25:09.000Z (8 months ago)
Last Synced: 2026-05-30T02:26:56.108Z (29 days ago)
Language: Python
Size: 43.9 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# S3 Vectors Benchmark

A comprehensive benchmark suite for comparing Amazon S3 Vectors against FAISS, NMSLib, and brute-force search methods at scale (10K to 10M vectors).

## Overview

This project provides a complete framework for benchmarking vector similarity search performance across different methods:
- **Amazon S3 Vectors** - AWS managed vector database
- **FAISS** - Facebook AI Similarity Search (HNSW index)
- **NMSLib** - Non-Metric Space Library (HNSW index)
- **Brute-force** - Baseline cosine similarity search

The benchmark evaluates:
- **Query latency** across different vector counts
- **Search accuracy** (Recall@K) using UKBench dataset
- **Scalability** from 10K to 10M vectors
- **Memory efficiency** and resource usage

## Features

- 🚀 **Multiple Vector Databases**: Support for S3 Vectors, FAISS, NMSLib, and brute-force
- 📊 **Comprehensive Metrics**: Query latency, recall, precision, and more
- 📈 **Visualization**: Automatic chart generation for results analysis
- 🔄 **Resume Capability**: Checkpoint support for long-running benchmarks
- 💾 **Embedding Caching**: Efficient storage and retrieval of embeddings
- 🎯 **UKBench Dataset**: Standard evaluation dataset with ground truth
- ⚙️ **Configurable**: YAML-based configuration for all parameters

## Prerequisites

- **Python**: 3.9 or higher
- **uv**: Fast Python package installer (recommended) or pip as fallback
- **AWS Account**: For S3 Vectors testing
- **Storage**: Sufficient disk space for datasets (~10-50 GB)
- **Memory**: 8 GB+ RAM recommended
- **GPU** (optional): For faster embedding generation

## Installation

### 1. Clone the repository

```bash
git clone https://github.com/Siddhant-K-code/s3-vectors-benchmark.git
cd s3-vectors-benchmark
```

### 2. Install uv (if not already installed)

```bash
# macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Or with pip
pip install uv
```

### 3. Install dependencies with uv

**Option A: Using uv sync (recommended)**

```bash
# Install project and dependencies (creates .venv automatically)
uv sync

# Or with dev dependencies for testing
uv sync --dev
```

**Option B: Manual virtual environment**

```bash
# Create virtual environment
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate

# Install project in editable mode with dependencies
uv pip install -e .

# Or install with dev dependencies
uv pip install -e ".[dev]"
```

### 4. Configure AWS credentials

```bash
aws configure
```

Or set environment variables:
```bash
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
```

### 5. Create configuration file

```bash
cp config.yaml.example config.yaml
# Edit config.yaml with your settings
```

Required settings in `config.yaml`:
- AWS region and S3 bucket name
- Dataset cache directories
- Benchmark parameters

## Quick Start

### 1. Download and prepare datasets

```bash
uv run python src/main.py prepare-data --dataset ukbench --download
uv run python src/main.py prepare-data --dataset coco --download
```

### 2. Generate embeddings

```bash
uv run python src/main.py generate-embeddings --model vit-s --dimension 384
```

This will:
- Load images from UKBench dataset
- Generate embeddings using DINOv2-small model
- Cache embeddings to HDF5 file

### 3. Run benchmark

```bash
uv run python src/main.py benchmark \
--embeddings data/embeddings/vit-s/embeddings_384d.h5 \
--vectors 10200 100000 1000000 \
--methods s3_vectors faiss nmslib \
--quick
```

The `--quick` flag runs a smaller test with fewer queries.

### 4. Generate visualizations

```bash
uv run python src/main.py visualize --latest
```

This generates three charts:
- `processing_time_ratio.png` - Processing time normalized to smallest dataset
- `search_accuracy.png` - Recall@K across different vector counts
- `processing_time_ms.png` - Query latency in milliseconds (S3 Vectors)

## Usage

### Data Preparation

Download and prepare datasets:

```bash
# UKBench only
uv run python src/main.py prepare-data --dataset ukbench --download

# COCO only
uv run python src/main.py prepare-data --dataset coco --download

# All datasets
uv run python src/main.py prepare-data --dataset all --download
```

### Embedding Generation

Generate embeddings with different models:

```bash
# DINOv2-small (384-dim)
uv run python src/main.py generate-embeddings --model vit-s

# DINOv2-base (768-dim)
uv run python src/main.py generate-embeddings --model vit-b

# DINOv2-large (1024-dim)
uv run python src/main.py generate-embeddings --model vit-l
```

### Benchmark Execution

Run full benchmark suite:

```bash
uv run python src/main.py benchmark \
--embeddings data/embeddings/vit-s/embeddings_384d.h5 \
--vectors 10200 100000 500000 1000000 10000000 \
--dimensions 384 \
--methods s3_vectors faiss nmslib bruteforce \
--output results/full_benchmark.json
```

Options:
- `--vectors`: Vector counts to test (multiple values)
- `--dimensions`: Vector dimensions to test
- `--methods`: Methods to benchmark (s3_vectors, faiss, nmslib, bruteforce)
- `--quick`: Quick test with fewer queries
- `--dry-run`: Estimate time/resources without running

### Testing S3 Vectors

Test S3 Vectors connection and basic operations:

```bash
uv run python src/main.py test-s3 \
--embeddings data/embeddings/vit-s/embeddings_384d.h5 \
--vectors 1000000 \
--dimension 384
```

### Visualization

Generate charts from results:

```bash
# Use latest results file
uv run python src/main.py visualize --latest

# Specify results file
uv run python src/main.py visualize \
--results-file results/benchmark_results_20250102_120000.json \
--output-dir results/charts
```

## Project Structure

```
s3-vectors-benchmark/
├── README.md # This file
├── pyproject.toml # Modern Python package configuration (uv)
├── requirements.txt # Legacy pip requirements (for reference)
├── setup.py # Package setup (legacy)
├── config.yaml.example # Configuration template
├── .gitignore # Git ignore rules
├── .python-version # Python version specification
├── src/ # Source code
│ ├── __init__.py
│ ├── main.py # CLI entry point
│ ├── config.py # Configuration management
│ ├── data_loader.py # Dataset loading
│ ├── embeddings.py # Embedding generation
│ ├── benchmark.py # Benchmark orchestration
│ ├── evaluate.py # Accuracy evaluation
│ ├── visualize.py # Chart generation
│ ├── utils.py # Utility functions
│ └── vector_dbs/ # Vector database implementations
│ ├── base.py # Abstract base class
│ ├── s3_vectors.py # S3 Vectors implementation
│ ├── faiss_db.py # FAISS implementation
│ ├── nmslib_db.py # NMSLib implementation
│ └── bruteforce.py # Brute-force baseline
├── tests/ # Test suite
├── notebooks/ # Jupyter notebooks
├── docs/ # Documentation
├── data/ # Datasets (git-ignored)
└── results/ # Results (git-ignored)
```

## Configuration

Edit `config.yaml` to customize:

### AWS Configuration
```yaml
aws:
region: us-east-1
profile: default
bucket_name: your-vector-bucket-name
```

### S3 Vectors Configuration
```yaml
s3_vectors:
index_name: benchmark-index
metric_type: cosine # or euclidean
batch_size: 500
```

### Benchmark Configuration
```yaml
benchmark:
vector_counts: [10200, 100000, 500000, 1000000, 10000000]
dimensions: [384, 768, 1024]
topk: 5
num_queries: 100
num_repeats: 3
```

## Results Analysis

After running benchmarks, results are saved to JSON files with:

- **Raw measurements**: Query latency, result IDs, similarities
- **Evaluation metrics**: Recall@K, Precision@K, aggregated statistics
- **Metadata**: Configuration, timestamps, vector counts

### Reading Results

Results are stored in JSON format:

```python
import json

with open("results/benchmark_results_20250102.json", "r") as f:
results = json.load(f)

# Access evaluation metrics
evaluation = results["evaluation"]
for config_key, metrics in evaluation.items():
print(f"{config_key}:")
print(f" Recall@K: {metrics['recall_at_k']['mean']:.3f}")
print(f" Query time: {metrics['query_time_ms']['mean']:.2f} ms")
```

### Charts

Charts are automatically generated showing:
1. **Processing Time Ratio**: Normalized query latency across methods
2. **Search Accuracy**: Recall@K across vector counts
3. **Processing Time (ms)**: S3 Vectors query latency

## Troubleshooting

### AWS Credentials Not Found

```bash
# Verify credentials
aws sts get-caller-identity

# Or set environment variables
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
```

### S3 Bucket Not Found

Ensure the bucket exists and is accessible:

```bash
aws s3 ls s3://your-bucket-name
```

### Out of Memory

For large datasets:
- Reduce `batch_size` in embeddings config
- Process in smaller chunks
- Use GPU for embedding generation

### Dataset Download Fails

Some datasets may require manual download. Check:
- Network connectivity
- Sufficient disk space
- Dataset availability

### uv Installation Issues

If `uv` is not found:
- Ensure it's in your PATH
- Restart your terminal after installation
- Use `pip install uv` as fallback

## Contributing

Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests (run with `uv run pytest`)
5. Submit a pull request

## License

MIT License - see LICENSE file for details

## Citation

If you use this benchmark in your research, please cite:

```bibtex
@software{s3_vectors_benchmark,
title={S3 Vectors Benchmark},
author={Siddhant Khare},
year={2025},
url={https://github.com/siddhant-k-code/s3-vectors-benchmark}
}
```

## Acknowledgments

- UKBench dataset: http://vis.uky.edu/~stewe/ukbench/
- Microsoft COCO dataset: https://cocodataset.org/
- DINOv2 models: https://github.com/facebookresearch/dinov2
- FAISS: https://github.com/facebookresearch/faiss
- NMSLib: https://github.com/nmslib/nmslib

## Support

For issues and questions:
- Open an issue on GitHub
- Check documentation in `docs/` directory
- Review example notebooks in `notebooks/`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/siddhant-k-code/s3-vectors-benchmark

Awesome Lists containing this project

README