{"id":50411803,"url":"https://github.com/siddhant-k-code/s3-vectors-benchmark","last_synced_at":"2026-05-31T04:02:33.243Z","repository":{"id":361299564,"uuid":"1088064735","full_name":"Siddhant-K-code/s3-vectors-benchmark","owner":"Siddhant-K-code","description":null,"archived":false,"fork":false,"pushed_at":"2025-11-02T09:25:09.000Z","size":45,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-30T02:26:56.108Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Siddhant-K-code.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-02T09:25:09.000Z","updated_at":"2026-05-29T11:54:31.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Siddhant-K-code/s3-vectors-benchmark","commit_stats":null,"previous_names":["siddhant-k-code/s3-vectors-benchmark"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Siddhant-K-code/s3-vectors-benchmark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Siddhant-K-code%2Fs3-vectors-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Siddhant-K-code%2Fs3-vectors-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Siddhant-K-code%2Fs3-vectors-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Siddhant-K-code%2Fs3-vectors-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Siddhant-K-code","download_url":"https://codeload.github.com/Siddhant-K-code/s3-vectors-benchmark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Siddhant-K-code%2Fs3-vectors-benchmark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33718446,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-31T04:02:30.737Z","updated_at":"2026-05-31T04:02:33.235Z","avatar_url":"https://github.com/Siddhant-K-code.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# S3 Vectors Benchmark\n\nA comprehensive benchmark suite for comparing Amazon S3 Vectors against FAISS, NMSLib, and brute-force search methods at scale (10K to 10M vectors).\n\n## Overview\n\nThis project provides a complete framework for benchmarking vector similarity search performance across different methods:\n- **Amazon S3 Vectors** - AWS managed vector database\n- **FAISS** - Facebook AI Similarity Search (HNSW index)\n- **NMSLib** - Non-Metric Space Library (HNSW index)\n- **Brute-force** - Baseline cosine similarity search\n\nThe benchmark evaluates:\n- **Query latency** across different vector counts\n- **Search accuracy** (Recall@K) using UKBench dataset\n- **Scalability** from 10K to 10M vectors\n- **Memory efficiency** and resource usage\n\n## Features\n\n- 🚀 **Multiple Vector Databases**: Support for S3 Vectors, FAISS, NMSLib, and brute-force\n- 📊 **Comprehensive Metrics**: Query latency, recall, precision, and more\n- 📈 **Visualization**: Automatic chart generation for results analysis\n- 🔄 **Resume Capability**: Checkpoint support for long-running benchmarks\n- 💾 **Embedding Caching**: Efficient storage and retrieval of embeddings\n- 🎯 **UKBench Dataset**: Standard evaluation dataset with ground truth\n- ⚙️ **Configurable**: YAML-based configuration for all parameters\n\n## Prerequisites\n\n- **Python**: 3.9 or higher\n- **uv**: Fast Python package installer (recommended) or pip as fallback\n- **AWS Account**: For S3 Vectors testing\n- **Storage**: Sufficient disk space for datasets (~10-50 GB)\n- **Memory**: 8 GB+ RAM recommended\n- **GPU** (optional): For faster embedding generation\n\n## Installation\n\n### 1. Clone the repository\n\n```bash\ngit clone https://github.com/Siddhant-K-code/s3-vectors-benchmark.git\ncd s3-vectors-benchmark\n```\n\n### 2. Install uv (if not already installed)\n\n```bash\n# macOS and Linux\ncurl -LsSf https://astral.sh/uv/install.sh | sh\n\n# Windows\npowershell -c \"irm https://astral.sh/uv/install.ps1 | iex\"\n\n# Or with pip\npip install uv\n```\n\n### 3. Install dependencies with uv\n\n**Option A: Using uv sync (recommended)**\n\n```bash\n# Install project and dependencies (creates .venv automatically)\nuv sync\n\n# Or with dev dependencies for testing\nuv sync --dev\n```\n\n**Option B: Manual virtual environment**\n\n```bash\n# Create virtual environment\nuv venv\nsource .venv/bin/activate  # On Windows: .venv\\Scripts\\activate\n\n# Install project in editable mode with dependencies\nuv pip install -e .\n\n# Or install with dev dependencies\nuv pip install -e \".[dev]\"\n```\n\n### 4. Configure AWS credentials\n\n```bash\naws configure\n```\n\nOr set environment variables:\n```bash\nexport AWS_ACCESS_KEY_ID=your_access_key\nexport AWS_SECRET_ACCESS_KEY=your_secret_key\nexport AWS_DEFAULT_REGION=us-east-1\n```\n\n### 5. Create configuration file\n\n```bash\ncp config.yaml.example config.yaml\n# Edit config.yaml with your settings\n```\n\nRequired settings in `config.yaml`:\n- AWS region and S3 bucket name\n- Dataset cache directories\n- Benchmark parameters\n\n## Quick Start\n\n### 1. Download and prepare datasets\n\n```bash\nuv run python src/main.py prepare-data --dataset ukbench --download\nuv run python src/main.py prepare-data --dataset coco --download\n```\n\n### 2. Generate embeddings\n\n```bash\nuv run python src/main.py generate-embeddings --model vit-s --dimension 384\n```\n\nThis will:\n- Load images from UKBench dataset\n- Generate embeddings using DINOv2-small model\n- Cache embeddings to HDF5 file\n\n### 3. Run benchmark\n\n```bash\nuv run python src/main.py benchmark \\\n  --embeddings data/embeddings/vit-s/embeddings_384d.h5 \\\n  --vectors 10200 100000 1000000 \\\n  --methods s3_vectors faiss nmslib \\\n  --quick\n```\n\nThe `--quick` flag runs a smaller test with fewer queries.\n\n### 4. Generate visualizations\n\n```bash\nuv run python src/main.py visualize --latest\n```\n\nThis generates three charts:\n- `processing_time_ratio.png` - Processing time normalized to smallest dataset\n- `search_accuracy.png` - Recall@K across different vector counts\n- `processing_time_ms.png` - Query latency in milliseconds (S3 Vectors)\n\n## Usage\n\n### Data Preparation\n\nDownload and prepare datasets:\n\n```bash\n# UKBench only\nuv run python src/main.py prepare-data --dataset ukbench --download\n\n# COCO only\nuv run python src/main.py prepare-data --dataset coco --download\n\n# All datasets\nuv run python src/main.py prepare-data --dataset all --download\n```\n\n### Embedding Generation\n\nGenerate embeddings with different models:\n\n```bash\n# DINOv2-small (384-dim)\nuv run python src/main.py generate-embeddings --model vit-s\n\n# DINOv2-base (768-dim)\nuv run python src/main.py generate-embeddings --model vit-b\n\n# DINOv2-large (1024-dim)\nuv run python src/main.py generate-embeddings --model vit-l\n```\n\n### Benchmark Execution\n\nRun full benchmark suite:\n\n```bash\nuv run python src/main.py benchmark \\\n  --embeddings data/embeddings/vit-s/embeddings_384d.h5 \\\n  --vectors 10200 100000 500000 1000000 10000000 \\\n  --dimensions 384 \\\n  --methods s3_vectors faiss nmslib bruteforce \\\n  --output results/full_benchmark.json\n```\n\nOptions:\n- `--vectors`: Vector counts to test (multiple values)\n- `--dimensions`: Vector dimensions to test\n- `--methods`: Methods to benchmark (s3_vectors, faiss, nmslib, bruteforce)\n- `--quick`: Quick test with fewer queries\n- `--dry-run`: Estimate time/resources without running\n\n### Testing S3 Vectors\n\nTest S3 Vectors connection and basic operations:\n\n```bash\nuv run python src/main.py test-s3 \\\n  --embeddings data/embeddings/vit-s/embeddings_384d.h5 \\\n  --vectors 1000000 \\\n  --dimension 384\n```\n\n### Visualization\n\nGenerate charts from results:\n\n```bash\n# Use latest results file\nuv run python src/main.py visualize --latest\n\n# Specify results file\nuv run python src/main.py visualize \\\n  --results-file results/benchmark_results_20250102_120000.json \\\n  --output-dir results/charts\n```\n\n## Project Structure\n\n```\ns3-vectors-benchmark/\n├── README.md                 # This file\n├── pyproject.toml            # Modern Python package configuration (uv)\n├── requirements.txt          # Legacy pip requirements (for reference)\n├── setup.py                  # Package setup (legacy)\n├── config.yaml.example       # Configuration template\n├── .gitignore                # Git ignore rules\n├── .python-version           # Python version specification\n├── src/                      # Source code\n│   ├── __init__.py\n│   ├── main.py               # CLI entry point\n│   ├── config.py              # Configuration management\n│   ├── data_loader.py         # Dataset loading\n│   ├── embeddings.py          # Embedding generation\n│   ├── benchmark.py           # Benchmark orchestration\n│   ├── evaluate.py            # Accuracy evaluation\n│   ├── visualize.py           # Chart generation\n│   ├── utils.py               # Utility functions\n│   └── vector_dbs/            # Vector database implementations\n│       ├── base.py             # Abstract base class\n│       ├── s3_vectors.py       # S3 Vectors implementation\n│       ├── faiss_db.py         # FAISS implementation\n│       ├── nmslib_db.py        # NMSLib implementation\n│       └── bruteforce.py       # Brute-force baseline\n├── tests/                     # Test suite\n├── notebooks/                 # Jupyter notebooks\n├── docs/                      # Documentation\n├── data/                       # Datasets (git-ignored)\n└── results/                    # Results (git-ignored)\n```\n\n## Configuration\n\nEdit `config.yaml` to customize:\n\n### AWS Configuration\n```yaml\naws:\n  region: us-east-1\n  profile: default\n  bucket_name: your-vector-bucket-name\n```\n\n### S3 Vectors Configuration\n```yaml\ns3_vectors:\n  index_name: benchmark-index\n  metric_type: cosine  # or euclidean\n  batch_size: 500\n```\n\n### Benchmark Configuration\n```yaml\nbenchmark:\n  vector_counts: [10200, 100000, 500000, 1000000, 10000000]\n  dimensions: [384, 768, 1024]\n  topk: 5\n  num_queries: 100\n  num_repeats: 3\n```\n\n## Results Analysis\n\nAfter running benchmarks, results are saved to JSON files with:\n\n- **Raw measurements**: Query latency, result IDs, similarities\n- **Evaluation metrics**: Recall@K, Precision@K, aggregated statistics\n- **Metadata**: Configuration, timestamps, vector counts\n\n### Reading Results\n\nResults are stored in JSON format:\n\n```python\nimport json\n\nwith open(\"results/benchmark_results_20250102.json\", \"r\") as f:\n    results = json.load(f)\n\n# Access evaluation metrics\nevaluation = results[\"evaluation\"]\nfor config_key, metrics in evaluation.items():\n    print(f\"{config_key}:\")\n    print(f\"  Recall@K: {metrics['recall_at_k']['mean']:.3f}\")\n    print(f\"  Query time: {metrics['query_time_ms']['mean']:.2f} ms\")\n```\n\n### Charts\n\nCharts are automatically generated showing:\n1. **Processing Time Ratio**: Normalized query latency across methods\n2. **Search Accuracy**: Recall@K across vector counts\n3. **Processing Time (ms)**: S3 Vectors query latency\n\n## Troubleshooting\n\n### AWS Credentials Not Found\n\n```bash\n# Verify credentials\naws sts get-caller-identity\n\n# Or set environment variables\nexport AWS_ACCESS_KEY_ID=...\nexport AWS_SECRET_ACCESS_KEY=...\n```\n\n### S3 Bucket Not Found\n\nEnsure the bucket exists and is accessible:\n\n```bash\naws s3 ls s3://your-bucket-name\n```\n\n### Out of Memory\n\nFor large datasets:\n- Reduce `batch_size` in embeddings config\n- Process in smaller chunks\n- Use GPU for embedding generation\n\n### Dataset Download Fails\n\nSome datasets may require manual download. Check:\n- Network connectivity\n- Sufficient disk space\n- Dataset availability\n\n### uv Installation Issues\n\nIf `uv` is not found:\n- Ensure it's in your PATH\n- Restart your terminal after installation\n- Use `pip install uv` as fallback\n\n## Contributing\n\nContributions are welcome! Please:\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests (run with `uv run pytest`)\n5. Submit a pull request\n\n## License\n\nMIT License - see LICENSE file for details\n\n## Citation\n\nIf you use this benchmark in your research, please cite:\n\n```bibtex\n@software{s3_vectors_benchmark,\n  title={S3 Vectors Benchmark},\n  author={Siddhant Khare},\n  year={2025},\n  url={https://github.com/siddhant-k-code/s3-vectors-benchmark}\n}\n```\n\n## Acknowledgments\n\n- UKBench dataset: http://vis.uky.edu/~stewe/ukbench/\n- Microsoft COCO dataset: https://cocodataset.org/\n- DINOv2 models: https://github.com/facebookresearch/dinov2\n- FAISS: https://github.com/facebookresearch/faiss\n- NMSLib: https://github.com/nmslib/nmslib\n\n## Support\n\nFor issues and questions:\n- Open an issue on GitHub\n- Check documentation in `docs/` directory\n- Review example notebooks in `notebooks/`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsiddhant-k-code%2Fs3-vectors-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsiddhant-k-code%2Fs3-vectors-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsiddhant-k-code%2Fs3-vectors-benchmark/lists"}