{"id":37637300,"url":"https://github.com/superlinked/external-benchmarks","last_synced_at":"2026-01-16T11:10:50.739Z","repository":{"id":309468543,"uuid":"1035198521","full_name":"superlinked/external-benchmarks","owner":"superlinked","description":"Code required for preparing benchmarking datasets used by Superlinked and our database partners.","archived":false,"fork":false,"pushed_at":"2025-09-19T09:56:45.000Z","size":11706,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-19T11:47:27.644Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/superlinked.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-09T21:37:05.000Z","updated_at":"2025-09-19T09:56:49.000Z","dependencies_parsed_at":null,"dependency_job_id":"f024a285-6495-46f3-beb5-cfe207e878f6","html_url":"https://github.com/superlinked/external-benchmarks","commit_stats":null,"previous_names":["superlinked/external-benchmarks"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/superlinked/external-benchmarks","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinked%2Fexternal-benchmarks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinked%2Fexternal-benchmarks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinked%2Fexternal-benchmarks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinked%2Fexternal-benchmarks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/superlinked","download_url":"https://codeload.github.com/superlinked/external-benchmarks/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinked%2Fexternal-benchmarks/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28478129,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T06:30:42.265Z","status":"ssl_error","status_checked_at":"2026-01-16T06:30:16.248Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-16T11:10:50.554Z","updated_at":"2026-01-16T11:10:50.722Z","avatar_url":"https://github.com/superlinked.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Vector Search Benchmarks\n\nThis repo contains datasets for benchmarking vector search performance, to help Superlinked prioritize integration partners.\n\n## Overview\n\nWe reviewed a number of publicly available datasets and noted 3 core problems + here is how this dataset fixes them:\n\n|Problems of other vector search benchmarks| How this dataset solves it                                         |\n|-|--------------------------------------------------------------------|\n|Not enough metadata of various types makes it hard to test filter performance| 3 number, 1 categorical, 3 text, 1 image column                    |\n|Vectors too small, while SOTA models usually output 2k+ even 4k+ dims| 4154 dims                                                          |\n|Dataset too small, especially if larger vectors are used| 100k, 1M and 10M item variants, all sampled from the large dataset |\n\n## Available Datasets\n\n### Product data\n\nThe folders contain `parquet` files with the metadata and vectors.\n\n| Dataset        | Records    | # Files | Size    |\n|----------------|------------|---------|---------|\n| benchmark_10k  | 10,000     | 100     | ~230 MB |\n| benchmark_100k | 100,000    | 100     | ~2.3 GB |\n| benchmark_1M   | 1,000,000  | 100     | ~23 GB  |\n| benchmark_10M  | 10,534,536 | 1000    | ~240 GB |\n\nThe structure of the files is the same throughout:\n\n```\nSchema([('parent_asin', String), # the id\n        ('main_category', String),\n        ('title', String),\n        ('average_rating', Float64),\n        ('rating_number', Float64),\n        ('description', String),\n        ('price', Float64),\n        ('categories', String),\n        ('image_url', String)])\n        ('value', List(Float64)), # the vectors\n```\n\n### Queries\n\nSome smaller dataset versions have a query set guaranteed to only contain parent_asins from the corresponding dataset version.\nThe smaller versions are created for testing purposes when only a smaller dataset was ingested. \nIn the [query](superlinked_app/query.py) file the actual query structure can be seen.\nThe file structure is\n```\n{\n    query_id: {\n        product_id: str | None,       # parent_asin - get that value from the database and search with it\n        rating_max: int | None,       # filter for product.average_rating \u003c= rating_max\n        rating_num_min: int | None,   # filter product.rating_number \u003e= rating_num_min\n        main_category: str | None,    # filter for product.main_category == main_category\n    },\n    ...\n}\n```\n\n| Dataset               | Queries |\n|-----------------------|---------|\n| query-params-100k     | 15      |\n| query-params-1M       | 117     |\n| query-params-10M      | 1,000   |\n\n### Result set\n\nQuery results are stored in `ranked-results.json`. \nThe structure is\n\n```\n{\n    query_id: [ordered list of result parent_asins],\n    ...\n}\n```\n\nNOTE: The results expect all products ingested in the database!\n\n## Data Access\n\nDatasets are available via multiple ways:\n\n1. You can use gsutil to download the dataset (as HTTPS download works best for individual files):\n```bash\n# Download benchmark datasets\ngsutil cp -r \"gs://superlinked-benchmarks-external/amazon-products-images/benchmark-10k/**\" ./your/local/data/folder/\ngsutil cp -r \"gs://superlinked-benchmarks-external/amazon-products-images/benchmark-100k/**\" ./your/local/data/folder/\ngsutil cp -r \"gs://superlinked-benchmarks-external/amazon-products-images/benchmark-1M/**\" ./your/local/data/folder/\ngsutil cp -r \"gs://superlinked-benchmarks-external/amazon-products-images/benchmark-10M/**\" ./your/local/data/folder/\n```\nAs queries are individual files, even a simple https download works fine:\n```bash\n# Download queries\nwget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products-images/query-params-100k.json\nwget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products-images/query-params-1M.json\nwget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products-images/query-params-10M.json\n```\nSame is true for results:\n```bash\n# Download the ground truth query results\nwget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products-images/ranked-results.json\n```\nbut gsutil works fine for these as well (you can infer the path from the URLs). For `ranked-results.json`:\n```bash\ngsutil cp \"gs://superlinked-benchmarks-external/amazon-products-images/ranked-results.json\" ./your/local/data/folder/\n```\n2. Using huggingface datasets\n\nThe product data is available using [HF Datasets](https://huggingface.co/docs/datasets/en/index).\n\n```python\nfrom datasets import load_dataset\n\nbenchmark_10k = load_dataset(\"superlinked/external-benchmarking\", data_dir=\"benchmark-10k\")\nbenchmark_100k = load_dataset(\"superlinked/external-benchmarking\", data_dir=\"benchmark-100k\")\nbenchmark_1M = load_dataset(\"superlinked/external-benchmarking\", data_dir=\"benchmark-1M\")\nbenchmark_10M = load_dataset(\"superlinked/external-benchmarking\", data_dir=\"benchmark-10M\")\n```\n\nFor query and result data, please use one of the above methods (gsutil or direct download).\n\n## Dataset Production\n\n### Source Data\n- **Origin**: [Amazon Reviews 2023 dataset](https://amazon-reviews-2023.github.io/)\n- **Categories**: `[\"Books\", \"Automotive\", \"Tools and Home Improvement\", \"All Beauty\", \"Electronics\", \"Software\", \"Health and Household\"]`\n\n### Embeddings\n\nThe embeddings are created via a [superlinked config](superlinked_app). The resulting 4154 dim vector contains:\n- 1 categorical,\n- 3 number,\n- 3 text (`Qwen/Qwen3-Embedding-0.6B`),\n- and 1 image (`laion/CLIP-ViT-H-14-laion2B-s32B-b79K`)\n\nembeddings concatenated.\n\nThe float precision used throughout is fp16, or half-precision.\n\n## Running Benchmarks\n\nFor the `benchmark_10M` setup produce the following set of measurements - basically fill in the 'TBD' cells:\n\n| # | Write | Target | Observed |Read | Target | Observed |\n|-|-|-|-|-|-|-|\n|1|Create Index from scratch | \u003c 2hrs |TBD|-|-|-|\n|2|- | - |-|20 QPS of 0.001% filter selectivity| 100ms @ p95 | TBD |\n|3|- | - |-|20 QPS of 0.1% filter selectivity| 100ms @ p95 | TBD |\n|4|- | - |-|20 QPS of 1% filter selectivity| 100ms @ p95 | TBD |\n|5|- | - |-|20 QPS of 10% filter selectivity| 100ms @ p95 | TBD |\n|6|20 QPS for single-object updates (incl. embedding)| 2s @ p95 | TBD |20 QPS of 1% filter selectivity| 100ms @ p95 | TBD |\n|7|200 QPS for single-object updates (incl. embedding)| 2s @ p95 | TBD |20 QPS of 1% filter selectivity| 100ms @ p95 | TBD |\n\nFormulate the queries like this:\n1. **Vector Similarity**: Each query should contain `dot product` similarity scoring against a vector that you grab from the DB. \nThe vector is specified in query_params under the `product_id` key.\n2. **Filters**: To get the target filter selectivity, please use the filters specified in the `query_params` files.\n3. **Results details**: Add `LIMIT 100` to all queries and only retrieve `parent_asin` for each record to minimize networking overhead.\n4. **Vector Search Recall**: We expect that you can tune your system to produce \u003e90% average hit rate for the ANN index and we expect that you run the above tests with such tuning.\n\n|Selectivity| Predicate                                                                       |\n|-|---------------------------------------------------------------------------------|\n|0.001%| `average_rating \u003c= 3.0 and rating_number \u003e= 130 and main_category == 'Computers'` |\n|0.1%| `average_rating \u003c= 3.5 and rating_number \u003e= 30 and main_category == 'Computers'` |\n|1%| `rating_number \u003e= 45 and main_category == 'Computers'`                        |\n|10%| `average_rating \u003c= 3.5 and rating_number \u003e= 1`                                   |\n\n## Query result quality evaluation\n\nYou are welcome to use the `calculate_hit_rates` function in [eval.py](eval.py).\nIt expects the prediction results in a similar format as the ground truth result set is provided.\n\n## Pricing\n\nTo enable us to compare different vendors, we consider the above dataset size + performance to be a \"unit\" of vector search, for which we would like to know:\n1. What are the vector search vendor parameters of the cloud instance that can support this \"unit\".\n2. What is the price-per-GB-month for this instance, assuming a sustained average workload as described by the targets above.\n3. How does the price scale with (a) 2x the size (b) 2x the read QPS (c) 2x the write QPS.\n\n## License\n\nThis dataset is derived from the Amazon Reviews 2023 dataset. Please refer to the [original dataset's license](https://amazon-reviews-2023.github.io/) for usage terms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuperlinked%2Fexternal-benchmarks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsuperlinked%2Fexternal-benchmarks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuperlinked%2Fexternal-benchmarks/lists"}