{"id":28493347,"url":"https://github.com/qdrant/ann-filtering-benchmark-datasets","last_synced_at":"2025-09-09T21:52:01.308Z","repository":{"id":104790851,"uuid":"495427890","full_name":"qdrant/ann-filtering-benchmark-datasets","owner":"qdrant","description":"Collection of datasets for benchmarking filtered vector similarity retrieval","archived":false,"fork":false,"pushed_at":"2025-06-05T08:40:53.000Z","size":269,"stargazers_count":43,"open_issues_count":2,"forks_count":6,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-06-05T09:26:06.505Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://qdrant.tech/benchmarks/#filtered-search-benchmark","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/qdrant.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-05-23T13:42:18.000Z","updated_at":"2025-06-05T08:40:55.000Z","dependencies_parsed_at":"2023-11-19T14:24:24.932Z","dependency_job_id":"a015cc1a-7e32-4b30-9399-20862747b23e","html_url":"https://github.com/qdrant/ann-filtering-benchmark-datasets","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/qdrant/ann-filtering-benchmark-datasets","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fann-filtering-benchmark-datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fann-filtering-benchmark-datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fann-filtering-benchmark-datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fann-filtering-benchmark-datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/qdrant","download_url":"https://codeload.github.com/qdrant/ann-filtering-benchmark-datasets/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fann-filtering-benchmark-datasets/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264200705,"owners_count":23571825,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-08T09:08:30.627Z","updated_at":"2025-07-08T05:30:32.753Z","avatar_url":"https://github.com/qdrant.png","language":"Python","funding_links":[],"categories":["Benchmarks \u0026 Evaluation"],"sub_categories":[],"readme":"# ANN Filtered Retrieval Datasets\n\nThis repo contains a collection of datasets, inspired by [ann-benchmarks](https://github.com/erikbern/ann-benchmarks) for searching for similar vectors with additional filtering conditions.\n\n## Motivation\n\nMore and more applications are now using vector similarity search in their products.\nThe task of approximate nearest neighbor (ANN) search has gone beyond the scope of academic research and the narrow circle of huge IT corporations. \n\nIn this regard, the issue of supplementing vector search with application business logic is becoming more and more relevant.\n\n## Examples and cases\n\nIt is no longer enough to simply search for similar dishes by photo, you only need to search for them in those restaurants that are in the delivery area.\n\nIt is not enough to search for all items similar by description, you also need to consider price ranges, stock availability, etc.\n\nIt's not enough to find candidates for a job position based on similar skills, you also have to consider location, level of spoken language, and seniority.\n\nYou name it.\n\n## Is it that different?\n\nClassical approaches to ANN, and their implementations in many libraries, were usually customized for benchmarks, where the search speed among all vectors is the only comparison criterion.\n\nBecause of this, they had to sacrifice many functions that are useful in other situations: the ability to quickly delete, insert and modify stored values, as well as saving and  filtering based on metadata.\n\n## Data\n\n| description                      | Num vectors | dim  | distance | filters               | link                                                                                            |\n|----------------------------------|-------------|------|----------|-----------------------|-------------------------------------------------------------------------------------------------|\n| all-MiniLM-L6-v2 ArXiv titles    | 2 138 591   | 384  | Cosine   | match keyword / range | [link](https://storage.googleapis.com/ann-filtered-benchmark/datasets/arxiv.tar.gz)             | \n| Efficientnet encoded H\u0026M Clothes | 105 100     | 2048 | Cosine   | match keyword         | [link](https://storage.googleapis.com/ann-filtered-benchmark/datasets/hnm.tgz)                  |\n| LAION Sample encoded with CLIP   | 100 000     | 512  | Cosine   | range                 | [link](https://storage.googleapis.com/ann-filtered-benchmark/datasets/laion-small-clip.tgz)     | \n| Random vectors \\ random payload  | 1 000 000   | 100  | Cosine   | match keyword         | [link](https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_keywords_1m.tgz)   |\n| Random vectors \\ random payload  | 1 000 000   | 100  | Cosine   | match int             | [link](https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_ints_1m.tgz)       |\n| Random vectors \\ random payload  | 1 000 000   | 100  | Cosine   | range                 | [link](https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_float_1m.tgz)      |\n| Random vectors \\ random payload  | 1 000 000   | 100  | Cosine   | geo-radius            | [link](https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_geo_1m.tgz)        |\n| Random vectors \\ random payload  | 100 000     | 2048 | Cosine   | match keyword         | [link](https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_keywords_100k.tgz) |\n| Random vectors \\ random payload  | 100 000     | 2048 | Cosine   | match int             | [link](https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_ints_100k.tgz)     |\n| Random vectors \\ random payload  | 100 000     | 2048 | Cosine   | range                 | [link](https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_float_100k.tgz)    |\n| Random vectors \\ random payload  | 100 000     | 2048 | Cosine   | geo-radius            | [link](https://storage.googleapis.com/ann-filtered-benchmark/datasets/random_geo_100k.tgz)      |\n\n### Data Format\n\nEach dataset contains of following files:\n\n* `vectors.npy` - Numpy matrix of vectors. Shape `num_vectors x dim`\n* `payloads.jsonl` - payload values, associated with vectors. Number of lines equal to `num_vectors`\n* `tests.jsonl` - collection of queries with filtering conditions and expected results. Contains fields:\n  * `query` - vector to be used for similarity search\n  * `conditions` - filtering conditions of 3 possible types: `match`, `range`, and `geo`\n  * `closest_ids` - IDs of records, expected to be found with given query\n  * `closest_scores` - similarity scores of associated IDs\n\n### Example queries\n\n```\n{\n  \"query\": [-0.034, -0.185, -0.21, ...],\n  \"conditions\": {\n    \"and\": [\n      {\n        \"department_name\": {\n          \"match\": {\n            \"value\": \"Divided Shoes\"\n          }\n        }\n      }\n    ]\n  },\n  \"closest_ids\": [565, 15631, 100747, ....],\n  \"closest_scores\": [0.734, 0.698, 0.697, 0.689, ...]\n}\n\n```\n\n### Sources\n\n* Random data generator - [script](./generators/random_data)\n* Image data - [kaggle](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations)\n* Image embeddings generator - [colab](https://colab.research.google.com/drive/1u5-gZjPzfDP50c7LQztlVd78kGPyTAb1?usp=sharing)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqdrant%2Fann-filtering-benchmark-datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqdrant%2Fann-filtering-benchmark-datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqdrant%2Fann-filtering-benchmark-datasets/lists"}