https://github.com/superlinked/external-benchmarks

Code required for preparing benchmarking datasets used by Superlinked and our database partners.
https://github.com/superlinked/external-benchmarks
Last synced: 27 days ago
JSON representation
Code required for preparing benchmarking datasets used by Superlinked and our database partners.
Host: GitHub
URL: https://github.com/superlinked/external-benchmarks
Owner: superlinked
Created: 2025-08-09T21:37:05.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-09-19T09:56:45.000Z (5 months ago)
Last Synced: 2025-09-19T11:47:27.644Z (5 months ago)
Language: Python
Size: 11.2 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Vector Search Benchmarks

This repo contains datasets for benchmarking vector search performance, to help Superlinked prioritize integration partners.

## Overview

We reviewed a number of publicly available datasets and noted 3 core problems + here is how this dataset fixes them:

|Problems of other vector search benchmarks| How this dataset solves it                                         |

|-|--------------------------------------------------------------------|

|Not enough metadata of various types makes it hard to test filter performance| 3 number, 1 categorical, 3 text, 1 image column                    |

|Vectors too small, while SOTA models usually output 2k+ even 4k+ dims| 4154 dims                                                          |

|Dataset too small, especially if larger vectors are used| 100k, 1M and 10M item variants, all sampled from the large dataset |

## Available Datasets

### Product data

The folders contain `parquet` files with the metadata and vectors.

| Dataset        | Records    | # Files | Size    |

|----------------|------------|---------|---------|

| benchmark_10k  | 10,000     | 100     | ~230 MB |

| benchmark_100k | 100,000    | 100     | ~2.3 GB |

| benchmark_1M   | 1,000,000  | 100     | ~23 GB  |

| benchmark_10M  | 10,534,536 | 1000    | ~240 GB |

The structure of the files is the same throughout:

```

Schema([('parent_asin', String), # the id

        ('main_category', String),

        ('title', String),

        ('average_rating', Float64),

        ('rating_number', Float64),

        ('description', String),

        ('price', Float64),

        ('categories', String),

        ('image_url', String)])

        ('value', List(Float64)), # the vectors

```

### Queries

Some smaller dataset versions have a query set guaranteed to only contain parent_asins from the corresponding dataset version.

The smaller versions are created for testing purposes when only a smaller dataset was ingested. 

In the [query](superlinked_app/query.py) file the actual query structure can be seen.

The file structure is

```

{

    query_id: {

        product_id: str | None,       # parent_asin - get that value from the database and search with it

        rating_max: int | None,       # filter for product.average_rating <= rating_max

        rating_num_min: int | None,   # filter product.rating_number >= rating_num_min

        main_category: str | None,    # filter for product.main_category == main_category

    },

    ...

}

```

| Dataset               | Queries |

|-----------------------|---------|

| query-params-100k     | 15      |

| query-params-1M       | 117     |

| query-params-10M      | 1,000   |

### Result set

Query results are stored in `ranked-results.json`. 

The structure is

```

{

    query_id: [ordered list of result parent_asins],

    ...

}

```

NOTE: The results expect all products ingested in the database!

## Data Access

Datasets are available via multiple ways:

1. You can use gsutil to download the dataset (as HTTPS download works best for individual files):

```bash

# Download benchmark datasets

gsutil cp -r "gs://superlinked-benchmarks-external/amazon-products-images/benchmark-10k/**" ./your/local/data/folder/

gsutil cp -r "gs://superlinked-benchmarks-external/amazon-products-images/benchmark-100k/**" ./your/local/data/folder/

gsutil cp -r "gs://superlinked-benchmarks-external/amazon-products-images/benchmark-1M/**" ./your/local/data/folder/

gsutil cp -r "gs://superlinked-benchmarks-external/amazon-products-images/benchmark-10M/**" ./your/local/data/folder/

```

As queries are individual files, even a simple https download works fine:

```bash

# Download queries

wget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products-images/query-params-100k.json

wget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products-images/query-params-1M.json

wget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products-images/query-params-10M.json

```

Same is true for results:

```bash

# Download the ground truth query results

wget https://storage.googleapis.com/superlinked-benchmarks-external/amazon-products-images/ranked-results.json

```

but gsutil works fine for these as well (you can infer the path from the URLs). For `ranked-results.json`:

```bash

gsutil cp "gs://superlinked-benchmarks-external/amazon-products-images/ranked-results.json" ./your/local/data/folder/

```

2. Using huggingface datasets

The product data is available using [HF Datasets](https://huggingface.co/docs/datasets/en/index).

```python

from datasets import load_dataset

benchmark_10k = load_dataset("superlinked/external-benchmarking", data_dir="benchmark-10k")

benchmark_100k = load_dataset("superlinked/external-benchmarking", data_dir="benchmark-100k")

benchmark_1M = load_dataset("superlinked/external-benchmarking", data_dir="benchmark-1M")

benchmark_10M = load_dataset("superlinked/external-benchmarking", data_dir="benchmark-10M")

```

For query and result data, please use one of the above methods (gsutil or direct download).

## Dataset Production

### Source Data

- **Origin**: [Amazon Reviews 2023 dataset](https://amazon-reviews-2023.github.io/)

- **Categories**: `["Books", "Automotive", "Tools and Home Improvement", "All Beauty", "Electronics", "Software", "Health and Household"]`

### Embeddings

The embeddings are created via a [superlinked config](superlinked_app). The resulting 4154 dim vector contains:

- 1 categorical,

- 3 number,

- 3 text (`Qwen/Qwen3-Embedding-0.6B`),

- and 1 image (`laion/CLIP-ViT-H-14-laion2B-s32B-b79K`)

embeddings concatenated.

The float precision used throughout is fp16, or half-precision.

## Running Benchmarks

For the `benchmark_10M` setup produce the following set of measurements - basically fill in the 'TBD' cells:

| # | Write | Target | Observed |Read | Target | Observed |

|-|-|-|-|-|-|-|

|1|Create Index from scratch | < 2hrs |TBD|-|-|-|

|2|- | - |-|20 QPS of 0.001% filter selectivity| 100ms @ p95 | TBD |

|3|- | - |-|20 QPS of 0.1% filter selectivity| 100ms @ p95 | TBD |

|4|- | - |-|20 QPS of 1% filter selectivity| 100ms @ p95 | TBD |

|5|- | - |-|20 QPS of 10% filter selectivity| 100ms @ p95 | TBD |

|6|20 QPS for single-object updates (incl. embedding)| 2s @ p95 | TBD |20 QPS of 1% filter selectivity| 100ms @ p95 | TBD |

|7|200 QPS for single-object updates (incl. embedding)| 2s @ p95 | TBD |20 QPS of 1% filter selectivity| 100ms @ p95 | TBD |

Formulate the queries like this:

1. **Vector Similarity**: Each query should contain `dot product` similarity scoring against a vector that you grab from the DB. 

The vector is specified in query_params under the `product_id` key.

2. **Filters**: To get the target filter selectivity, please use the filters specified in the `query_params` files.

3. **Results details**: Add `LIMIT 100` to all queries and only retrieve `parent_asin` for each record to minimize networking overhead.

4. **Vector Search Recall**: We expect that you can tune your system to produce >90% average hit rate for the ANN index and we expect that you run the above tests with such tuning.

|Selectivity| Predicate                                                                       |

|-|---------------------------------------------------------------------------------|

|0.001%| `average_rating <= 3.0 and rating_number >= 130 and main_category == 'Computers'` |

|0.1%| `average_rating <= 3.5 and rating_number >= 30 and main_category == 'Computers'` |

|1%| `rating_number >= 45 and main_category == 'Computers'`                        |

|10%| `average_rating <= 3.5 and rating_number >= 1`                                   |

## Query result quality evaluation

You are welcome to use the `calculate_hit_rates` function in [eval.py](eval.py).

It expects the prediction results in a similar format as the ground truth result set is provided.

## Pricing

To enable us to compare different vendors, we consider the above dataset size + performance to be a "unit" of vector search, for which we would like to know:

1. What are the vector search vendor parameters of the cloud instance that can support this "unit".

2. What is the price-per-GB-month for this instance, assuming a sustained average workload as described by the targets above.

3. How does the price scale with (a) 2x the size (b) 2x the read QPS (c) 2x the write QPS.

## License

This dataset is derived from the Amazon Reviews 2023 dataset. Please refer to the [original dataset's license](https://amazon-reviews-2023.github.io/) for usage terms.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/superlinked/external-benchmarks

Awesome Lists containing this project

README