{"id":30899356,"url":"https://github.com/google-deepmind/limit","last_synced_at":"2025-09-09T03:09:25.815Z","repository":{"id":312748798,"uuid":"1046615317","full_name":"google-deepmind/limit","owner":"google-deepmind","description":"On the Theoretical Limitations of Embedding-Based Retrieval","archived":false,"fork":false,"pushed_at":"2025-09-01T18:09:03.000Z","size":6753,"stargazers_count":380,"open_issues_count":1,"forks_count":22,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-01T20:23:02.743Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2508.21038","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google-deepmind.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-29T00:37:49.000Z","updated_at":"2025-09-01T18:19:45.000Z","dependencies_parsed_at":"2025-09-01T20:33:18.917Z","dependency_job_id":null,"html_url":"https://github.com/google-deepmind/limit","commit_stats":null,"previous_names":["google-deepmind/limit"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/google-deepmind/limit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Flimit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Flimit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Flimit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Flimit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google-deepmind","download_url":"https://codeload.github.com/google-deepmind/limit/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Flimit/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274238016,"owners_count":25247101,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-09T02:00:10.223Z","response_time":80,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-09-09T03:09:22.289Z","updated_at":"2025-09-09T03:09:25.802Z","avatar_url":"https://github.com/google-deepmind.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# On the Theoretical Limitations of Embedding-based Retrieval\n\nThis repository contains the official resources for the paper \"[On the Theoretical Limitations of Embedding-based Retrieval](https://arxiv.org/abs/2508.21038)\".\nThis work introduces the **LIMIT** dataset,\ndesigned to stress-test embedding models based on theoretical principles.\nWe show that for any given embedding dimension `d`,\nthere exists a combination of documents that cannot be returned by any query.\nWe use this theory to instantiate the dataset LIMIT,\nfinding that even state-of-the-art models struggle: highlighting a fundamental\nlimitation of the current single-vector embedding paradigm.\n\n![LIMIT Dataset Concept](assets/LIMIT.png)\n\n## Overview\n\n* [Data](#data)\n* [Code](#code)\n* [Evaluation](#evaluation)\n* [Citation](#citation)\n* [License and disclaimer](#license-and-disclaimer)\n\n## Data\n\nThe datasets used in our experiments are available in the `data/` directory of this repository, formatted in [MTEB](https://github.com/embeddings-benchmark/mteb) style (i.e. json lines).\n\nEach dataset contains:\n\n- A `queries.json` file containing a line for each of the 1000 queries, each with an `_id` and the `text` field.\n- A `corpus.json` file containing a line for each of the 50k (or 46 if using the `small` version) documents, each with an `_id`, `text` and empty `title` field.\n- A `qrels.json` file containing rows for each of the 2000 relevant query-\u003edoc mappings, mapping `query-id` of the queries into the `corpus-id` in the documents, with `score` indicating relevance.\n\n* **Full Dataset (`limit`):** The complete dataset, containing 50k documents.\n  * [Link to `data/limit`](./data/limit)\n\n* **Small Sample (`limit-small`):** A smaller version with only the 46 documents relevant to the queries.\n  * [Link to `data/limit-small`](./data/limit-small)\n\n## Code\n\nWe provide code to generate the LIMIT style datasets,\nas well as to run the free embedding experiment in the `code/` folder.\n\n* **Dataset Generation:** To generate the dataset from scratch, you can use the Jupyter notebook located at `code/generate_limit_dataset.ipynb`. This contains all necessary steps and dependencies.\n  * [Link to `code/generate_limit_dataset.ipynb`](./code/generate_limit_dataset.ipynb)\n\n* **Free Embedding Experiments:** The script to run the free embedding experiments can be found in `code/free_embedding_experiment.py`.\n  * [Link to `code/free_embedding_experiment.py`](./code/free_embedding_experiment.py)\n\nIf you use the free embedding code,\nyou'll need to install the following requirements.\n\n### Installation\n\nWe recommend using the [`uv` package manager](https://docs.astral.sh/uv/getting-started/installation/).\n\n```bash\n# Create a virtual environment\nuv venv\nsource .venv/bin/activate\n\n# Install dependencies\nuv pip install -r https://raw.githubusercontent.com/google-deepmind/limit/refs/heads/main/code/requirements.txt\n```\n\n## Loading with Huggingface Datasets\nYou can also load the data using the `datasets` library from Huggingface ([LIMIT](https://huggingface.co/datasets/orionweller/LIMIT), [LIMIT-small](https://huggingface.co/datasets/orionweller/LIMIT-small)):\n```python\nfrom datasets import load_dataset\nds = load_dataset(\"orionweller/LIMIT-small\", \"corpus\") # also available: queries, test (contains qrels).\n```\n\n## Evaluation\n\nEvaluation was done using the [MTEB framework](https://github.com/embeddings-benchmark/mteb) on the [v2.0.0 branch](https://github.com/embeddings-benchmark/mteb/tree/v2.0.0) (soon to be `main`). An example is:\n\n```python\nmodel_name = \"sentence-transformers/all-MiniLM-L6-v2\"\n\n# load the model using MTEB\nmodel = mteb.get_model(model_name) # will default to SentenceTransformers(model_name) if not implemented in MTEB\n# or using SentenceTransformers\nmodel = SentenceTransformers(model_name)\n\n# select the desired tasks and evaluate\ntasks = mteb.get_tasks(tasks=[\"LIMITSmallRetrieval\"]) # or use LIMITRetrieval for the full dataset\nresults = mteb.evaluate(model, tasks=tasks)\n```\n\nPlease see their Github for more details.\n\n## Citation\n\nIf you use this work, please cite the paper as:\n\n```bibtex\n@article{weller2025theoretical,\n  title={On the Theoretical Limitations of Embedding-Based Retrieval},\n  author={Weller, Orion and Boratko, Michael and Naim, Iftekhar and Lee, Jinhyuk},\n  journal={arXiv preprint arXiv:2508.21038},\n  year={2025}\n}\n```\n\n## License and disclaimer\n\nCopyright 2025 Google LLC\n\nAll software is licensed under the Apache License, Version 2.0 (Apache 2.0);\nyou may not use this file except in compliance with the Apache 2.0 license.\nYou may obtain a copy of the Apache 2.0 license at:\nhttps://www.apache.org/licenses/LICENSE-2.0\n\nAll other materials are licensed under the Creative Commons Attribution 4.0\nInternational License (CC-BY). You may obtain a copy of the CC-BY license at:\nhttps://creativecommons.org/licenses/by/4.0/legalcode\n\nUnless required by applicable law or agreed to in writing, all software and\nmaterials distributed here under the Apache 2.0 or CC-BY licenses are\ndistributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,\neither express or implied. See the licenses for the specific language governing\npermissions and limitations under those licenses.\n\nThis is not an official Google product.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-deepmind%2Flimit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle-deepmind%2Flimit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-deepmind%2Flimit/lists"}