{"id":15175440,"url":"https://github.com/cohere-ai/diskvectorindex","last_synced_at":"2025-04-07T05:10:29.058Z","repository":{"id":246790920,"uuid":"823299258","full_name":"cohere-ai/DiskVectorIndex","owner":"cohere-ai","description":null,"archived":false,"fork":false,"pushed_at":"2024-07-03T09:00:37.000Z","size":13,"stargazers_count":208,"open_issues_count":3,"forks_count":11,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-03-30T21:07:30.765Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cohere-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-02T19:08:30.000Z","updated_at":"2025-03-17T11:29:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"c924fd43-c24c-4bf5-8dbc-dba79325a0b5","html_url":"https://github.com/cohere-ai/DiskVectorIndex","commit_stats":null,"previous_names":["cohere-ai/diskvectorindex"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cohere-ai%2FDiskVectorIndex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cohere-ai%2FDiskVectorIndex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cohere-ai%2FDiskVectorIndex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cohere-ai%2FDiskVectorIndex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cohere-ai","download_url":"https://codeload.github.com/cohere-ai/DiskVectorIndex/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247595335,"owners_count":20963943,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-27T12:37:45.103Z","updated_at":"2025-04-07T05:10:29.036Z","avatar_url":"https://github.com/cohere-ai.png","language":"Python","readme":"# DiskVectorIndex - Ultra-Low Memory Vector Search on Large Dataset\r\n\r\nIndexing large datasets (100M+ embeddings) requires a lot of memory in most vector databases: For 100M documents/embeddings, most vector databases require about **500GB of memory**, driving the cost for your servers accordingly high.\r\n\r\nThis repository offers methods to be able to search on very large datasets (100M+) with just **300MB of memory**, making semantic search on such large datasets suitable for the Memory-Poor developers.\r\n\r\nWe provide various pre-build indices, that can be used to semantic search and powering your RAG applications.\r\n\r\n## Pre-Build Indices\r\n\r\nBelow you find different pre-build indices. The embeddings are downloaded at the first call, the size is specified under Index Size. Most of the embeddings are memory mapped from disk, e.g. for the `Cohere/trec-rag-2024-index` corpus you need 15 GB of disk, but just 380 MB of memory to load the index.\r\n\r\n| Name | Description | #Docs | Index Size (GB) | Memory Needed |\r\n| --- | --- | :---: | :---: | :---: | \r\n|  [Cohere/trec-rag-2024-index](https://huggingface.co/datasets/Cohere/trec-rag-2024-index) | Segmented corpus for [TREC RAG 2024](https://trec-rag.github.io/annoucements/2024-corpus-finalization/) | 113,520,750 | 15GB | 380MB |\r\n| fineweb-edu-10B-index (soon)  | 10B token sample from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) embedded and indexed on document level. | 9,267,429 | 1.4GB | 230MB |\r\n| fineweb-edu-100B-index (soon)  | 100B token sample from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) embedded and indexed on document level. | 69,672,066 | 9.2GB | 380MB\r\n| fineweb-edu-350B-index (soon)  | 350B token sample from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) embedded and indexed on document level. | 160,198,578 | 21GB | 380MB\r\n| fineweb-edu-index (soon) | Full 1.3T token dataset [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) embedded and indexed on document level. | 324,322,256 | 42GB | 285MB\r\n\r\n\r\nEach index comes with the respective corpus, that is chunked into smaller parts. These chunks are downloaded on-demand and reused for further queries.\r\n\r\n## Getting Started\r\n\r\nGet your free **Cohere API key** from [cohere.com](https://cohere.com). You must set this API key as an environment variable: \r\n```\r\nexport COHERE_API_KEY=your_api_key\r\n```\r\n\r\nInstall the package:\r\n```\r\npip install DiskVectorIndex\r\n```\r\n\r\nYou can then search via:\r\n```python\r\nfrom DiskVectorIndex import DiskVectorIndex\r\n\r\nindex = DiskVectorIndex(\"Cohere/trec-rag-2024-index\")\r\n\r\nwhile True:\r\n    query = input(\"\\n\\nEnter a question: \")\r\n    docs = index.search(query, top_k=3)\r\n    for doc in docs:\r\n        print(doc)\r\n        print(\"=========\")\r\n```\r\n\r\n\r\nYou can also load a fully downloaded index from disk via:\r\n```python\r\nfrom DiskVectorIndex import DiskVectorIndex\r\n\r\nindex = DiskVectorIndex(\"path/to/index\")\r\n```\r\n\r\n# End2End RAG Example\r\n\r\nWe can use the excellent RAG capabilities of the [Cohere Command R+](https://docs.cohere.com/docs/retrieval-augmented-generation-rag) model to build an end2end RAG pipeline:\r\n\r\n```python\r\nimport cohere\r\nfrom DiskVectorIndex import DiskVectorIndex\r\nimport os \r\nimport sys \r\n\r\nco = cohere.Client(api_key=os.environ[\"COHERE_API_KEY\"])\r\nindex = DiskVectorIndex(\"Cohere/trec-rag-2024-index\")\r\n\r\nquestion = \"Which popular deep learning frameworks were developed by Facebook and Google? What are their differences?\"\r\nprompt = f\"Answer the following question with a detailed answer: {question}\"\r\n\r\n\r\nprint(\"Question:\", question)\r\n\r\n# Step 1 - Decompose the question into sub-questions\r\nres = co.chat(\r\n  model=\"command-r-plus\",\r\n  message=prompt,\r\n  search_queries_only=True\r\n)\r\n\r\nsub_queries = [r.text for r in res.search_queries]\r\nprint(\"Generated sub queries:\", sub_queries)\r\n\r\n# Step 2 - Search for relevant documents for each sub \r\nprint(\"Start searching\")\r\ndocs = []\r\ndoc_id = 1\r\nfor query in sub_queries:\r\n    hits = index.search(query, top_k=3)\r\n    for hit in hits:\r\n        docs.append({\"id\": str(doc_id), 'title': hit['doc']['title'], 'snippet': hit['doc']['segment']})\r\n        doc_id += 1\r\n\r\nprint(f\"Documents found: {len(docs)}\")\r\n\r\n# Step 3 - Generate the response\r\nprint(\"Start generating response\")\r\nprint(\"==============\")\r\n\r\nfor event in co.chat_stream(model=\"command-r-plus\", message=prompt, documents=docs, citation_quality=\"fast\"):\r\n    if event.event_type == \"text-generation\":\r\n        #Print a text chunk\r\n        print(event.text, end=\"\")\r\n    elif event.event_type == \"citation-generation\":\r\n        #Print the citations as inline citations\r\n        print(\"[\"+\", \".join(event.citations[0].document_ids)+\"]\", end=\"\")\r\n```\r\n\r\n# How does the DiskVectorIndex work?\r\nThe Cohere embeddings have been optimized to work well in compressed vector space, as detailed in our [Cohere int8 \u0026 binary Embeddings blog post](https://cohere.com/blog/int8-binary-embeddings). The embeddings have not only been trained to work in float32, which requires a lot of memory, but to also operate well with int8, binary and Product Quantization (PQ) compression.\r\n\r\nThe above indices uses Product Quantization (PQ) to go from originally 1024*4=4096 bytes per embedding to just 128 bytes per embedding, reducing your memory requirement 32x.\r\n\r\nFurther, we use [faiss](https://github.com/facebookresearch/faiss) with a memory mapped IVF: In this case, only a small fraction (between 32,768 and 131,072) embeddings must be loaded in memory. \r\n\r\n\r\n# Need Semantic Search at Scale?\r\n\r\nAt [Cohere](https://cohere.com) we helped customers to run Semantic Search on tens of billions of embeddings, at a fraction of the cost. Feel free to reach out for [Nils Reimers](mailto:nils@cohere.com) if you need a solution that scales.","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcohere-ai%2Fdiskvectorindex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcohere-ai%2Fdiskvectorindex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcohere-ai%2Fdiskvectorindex/lists"}