{"id":28373106,"url":"https://github.com/kagisearch/vectordb","last_synced_at":"2025-06-25T12:30:37.956Z","repository":{"id":155453470,"uuid":"632269609","full_name":"kagisearch/vectordb","owner":"kagisearch","description":"A minimal Python package for storing and retrieving text using chunking, embeddings, and vector search.","archived":false,"fork":false,"pushed_at":"2024-10-01T20:50:32.000Z","size":1130,"stargazers_count":722,"open_issues_count":8,"forks_count":38,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-06-05T18:06:55.492Z","etag":null,"topics":["ai","artificial-intelligence","llm","llms","machine-learning"],"latest_commit_sha":null,"homepage":"https://vectordb.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kagisearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-25T04:13:24.000Z","updated_at":"2025-05-31T15:56:40.000Z","dependencies_parsed_at":null,"dependency_job_id":"b217d3ff-c598-4dfc-8030-bde73f8309ae","html_url":"https://github.com/kagisearch/vectordb","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kagisearch/vectordb","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kagisearch%2Fvectordb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kagisearch%2Fvectordb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kagisearch%2Fvectordb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kagisearch%2Fvectordb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kagisearch","download_url":"https://codeload.github.com/kagisearch/vectordb/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kagisearch%2Fvectordb/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261874089,"owners_count":23223061,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","artificial-intelligence","llm","llms","machine-learning"],"created_at":"2025-05-29T18:39:23.543Z","updated_at":"2025-06-25T12:30:37.944Z","avatar_url":"https://github.com/kagisearch.png","language":"Python","funding_links":[],"categories":["Sdks \u0026 Libraries","Python"],"sub_categories":[],"readme":"# VectorDB\n\n\nVectorDB is a simple, lightweight, fully local, end-to-end solution for using embeddings-based text retrieval.\n\nThanks to its low latency and small memory footprint, VectorDB is used to power AI features inside [Kagi Search](https://kagi.com).\n\nCheck an [example Colab notebook](https://colab.research.google.com/drive/1pecKGCCru_Jvx7v0WRNrW441EBlcS5qS#scrollTo=Eh6o8m7d8eOk) where this is used to filter the content of [Kagi Small Web](https://kagi.com/smallweb) RSS feed based on stated user interests.\n\n\n## Installation\n\nTo install VectorDB, use pip:\n\n```\npip install vectordb2\n```\n\n## Usage\n\nQuick example that loads data into memory, and runs retrieval. All data will be handled locally, including embeddings and vector search, completely trasparent for the user with maximum possible performance. \n\n```python\nfrom vectordb import Memory\n\n# Memory is where all content you want to store/search goes.\nmemory = Memory()\n\nmemory.save(\n    [\"apples are green\", \"oranges are orange\"],  # save your text content. for long text we will automatically chunk it\n    [{\"url\": \"https://apples.com\"}, {\"url\": \"https://oranges.com\"}], # associate any kind of metadata with it (optional)\n)\n\n# Search for top n relevant results, automatically using embeddings\nquery = \"green\"\nresults = memory.search(query, top_n = 1)\n\nprint(results)\n```\n\nThis returns the chunks with the added metadata and the vector distance (where 0 is the exact match and higher means further apart)\n\n```json\n[\n  {\n    \"chunk\": \"apples are green\",\n    \"metadata\": {\"url\": \"https://apples.com\"},\n    \"distance\": 0.87\n  }\n]\n```\n\n## Options\n\n\n**Memory(memory_file=None, chunking_strategy={\"mode\":\"sliding_window\"},\nembeddings=\"normal\")**\n\n\n- `memory_file`: *Optional.* Path to the memory file. If provided, memory will persist to disk and loaded/saved to this file. \n- `chunking_strategy`: *Optional.* Dictionary containing the chunking mode.\n  \n   Options:\\\n  `{'mode':'sliding_window', 'window_size': 240, 'overlap': 8}`   (default)\\\n  `{'mode':'paragraph'}`\n- `embeddings`: *Optional.* \n  \n   Options:\\\n   `fast` - Uses Universal Sentence Encoder 4\\\n   `normal` - Uses \"BAAI/bge-small-en-v1.5\" (default)\\\n   `best` - Uses \"BAAI/bge-base-en-v1.5\"\\\n   `multilingual` - Uses Universal Sentence Encoder Multilingual Large 3\n\n\n   You can also specify a custom HuggingFace model by name eg. `TaylorAI/bge-micro-v2`. See also [Pretrained models](https://www.sbert.net/docs/pretrained_models.html) and [MTEB](https://huggingface.co/spaces/mteb/leaderboard).\n\n**Memory.save(texts, metadata, memory_file=None)**\n\nSave content to memory. Metadata will be automatically optimized to use less resources.\n\n- `texts`: *Required.*  Text or list of texts to be saved.\n- `metdata`: *Optional.* Metadata or list of metadata associated with the texts.\n- `memory_file`: *Optional.* Path to persist the memory file. By default \n\n**Memory.search(query, top_n=5, unique=False, batch_results=\"flatten\")**\n\nSearch inside memory.\n\n- `query`: *Required.* Query text or  list of queries (see `batch_results` option below for handling results for a list).\n- `top_n`:  *Optional.* Number of most similar chunks to return (default: 5).\n- `unique`:  *Optional.* Return only items chunks from unique original texts (additional chunks coming from the same text will be ignored). Note this may return less chhunks than requested (default: False).\n- `batch_results`:  *Optional.* When input is a list of queries, output algorithm can be \"flatten\" or \"diverse\". Flatten returns true nearest neighbours across all input queries, meaning all results could come from just one query. \"diverse\" attempts to spread out the results, so that each query's nearest neighbours are equally added (neareast first across all queries, than 2nd nearest and so on). (default: \"flatten\")\n\n**Memory.clear()**\n\nClears the memory.\n\n\n**Memory.dump()**\n\nPrints the contents of the memory.\n\n\n## Example\n\n```python\nfrom vectordb import Memory\n\nmemory = Memory(\n    chunking_strategy={\"mode\": \"sliding_window\", \"window_size\": 128, \"overlap\": 16}, embeddings='TaylorAI/bge-micro-v2'\n)\n\ntexts = [\n    \"\"\"\nMachine learning is a method of data analysis that automates analytical model building.\n\nIt is a branch of artificial intelligence based on the idea that systems can learn from data,\nidentify patterns and make decisions with minimal human intervention.\n\nMachine learning algorithms are trained on data sets that contain examples of the desired output. For example, a machine learning algorithm that is used to classify images might be trained on a data set that contains images of cats and dogs.\nOnce an algorithm is trained, it can be used to make predictions on new data. For example, the machine learning algorithm that is used to classify images could be used to predict whether a new image contains a cat or a dog.\n\nMachine learning algorithms can be used to solve a wide variety of problems. Some common applications of machine learning include:\n\nClassification: Categorizing data into different groups. For example, a machine learning algorithm could be used to classify emails as spam or not spam.\n\nRegression: Predicting a continuous value. For example, a machine learning algorithm could be used to predict the price of a house.\n\nClustering: Finding groups of similar data points. For example, a machine learning algorithm could be used to find groups of customers with similar buying habits.\n\nAnomaly detection: Finding data points that are different from the rest of the data. For example, a machine learning algorithm could be used to find fraudulent credit card transactions.\n\nMachine learning is a powerful tool that can be used to solve a wide variety of problems. As the amount of data available continues to grow, machine learning is likely to become even more important in the future.\n\"\"\",\n    \"\"\"\nArtificial intelligence (AI) is the simulation of human intelligence in machines\nthat are programmed to think like humans and mimic their actions.\n\nThe term may also be applied to any machine that exhibits traits associated with\na human mind such as learning and problem-solving.\n\nAI research has been highly successful in developing effective techniques for solving a wide range of problems, from game playing to medical diagnosis.\n\nHowever, there is still a long way to go before AI can truly match the intelligence of humans. One of the main challenges is that human intelligence is incredibly complex and poorly understood.\n\nDespite the challenges, AI is a rapidly growing field with the potential to revolutionize many aspects of our lives. Some of the potential benefits of AI include:\n\nIncreased productivity: AI can be used to automate tasks that are currently performed by humans, freeing up our time for more creative and fulfilling activities.\n\nImproved decision-making: AI can be used to make more informed decisions, based on a wider range of data than humans can typically access.\n\nEnhanced creativity: AI can be used to generate new ideas and solutions, beyond what humans can imagine on their own.\nOf course, there are also potential risks associated with AI, such as:\n\nJob displacement: As AI becomes more capable, it is possible that it will displace some human workers.\n\nWeaponization: AI could be used to develop new weapons that are more powerful and destructive than anything we have today.\n\nLoss of control: If AI becomes too powerful, we may lose control over it, with potentially disastrous consequences.\n\nIt is important to weigh the potential benefits and risks of AI carefully as we continue to develop this technology. With careful planning and oversight, AI has the potential to make the world a better place. However, if we are not careful, it could also lead to serious problems.\n\"\"\",\n]\n\nmetadata_list = [\n    {\n        \"title\": \"Introduction to Machine Learning\",\n        \"url\": \"https://example.com/introduction-to-machine-learning\",\n    },\n    {\n        \"title\": \"Introduction to Artificial Intelligence\",\n        \"url\": \"https://example.com/introduction-to-artificial-intelligence\",\n    },\n]\n\nmemory.save(texts, metadata_list)\n\nquery = \"What is the relationship between AI and machine learning?\"\nresults = memory.search(query, top_n=3, unique=True)\nprint(results)\n\n# two results will be returned as unique param is set to True\n```\n\nOutput:\n```json\n[\n  {\n    \"chunk\": \"Artificial intelligence (AI) is the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving. AI research has been highly successful in developing effective techniques for solving a wide range of problems, from game playing to medical diagnosis. However, there is still a long way to go before AI can truly match the intelligence of humans. One of the main challenges is that human intelligence is incredibly complex and poorly understood. Despite the challenges, AI is a rapidly growing field with the potential to revolutionize many aspects of our lives. Some of the potential benefits of AI include: Increased\",\n    \"metadata\": {\n      \"title\": \"Introduction to Artificial Intelligence\",\n      \"url\": \"https://example.com/introduction-to-artificial-intelligence\"\n    },\n    \"distance\": 0.87\n  },\n  {\n    \"chunk\": \"Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. Machine learning algorithms are trained on data sets that contain examples of the desired output. For example, a machine learning algorithm that is used to classify images might be trained on a data set that contains images of cats and dogs. Once an algorithm is trained, it can be used to make predictions on new data. For example, the machine learning algorithm that is used to classify images could be used to predict whether a new image contains a cat or a dog. Machine learning algorithms can be used\",\n    \"metadata\": {\n      \"title\": \"Introduction to Machine Learning\",\n      \"url\": \"https://example.com/introduction-to-machine-learning\"\n    },\n    \"distance\": 0.83\n  }\n]\n\n```\n\n## Embeddings performance analysis\n\n\nWe constantly evaluate embedding models using standardized benchmarks (higher is better). Average latency is measured locally on CPU (lower is better). Benchmark data pulled from [MTEB](https://huggingface.co/spaces/mteb/leaderboard). \n\n\n\n| Model                                         | Latency  | Benchmark 1 | Benchmark 2 | Benchmark 3 | Benchmark 4 |\n|-----------------------------------------------|----------|-------------|-------------|-------------|-------------|\n| all-mpnet-base-v2                              | 6.12 s   | 80.28       | 65.07       | 43.69       | 83.04       |\n| all-MiniLM-L6-v2                               | 1.14 s   | 78.9        | 63.05       | 42.35       | 82.37       |\n| BAAI/bge-large-en-v1.5                         | 20.8 s   | 83.11       | 75.97       | 46.08       | 87.12       |\n| BAAI/bge-base-en-v1.5                          | 6.48 s   | 82.4        | 75.53       | 45.77       | 86.55       |\n| BAAI/bge-small-en-v1.5                         | 1.85 s   | 81.59       | 74.14       | 43.82       | 84.92       |\n| TaylorAI/bge-micro-v2                          | 0.671 s  | 78.65       | 68.04       | 39.18       | 82.81       |\n| TaylorAI/gte-tiny                              | 1.25 s   | 80.46       | 70.35       | 42.09       | 82.83       |\n| thenlper/gte-base                              | 6.28 s   | 82.3        | 73.01       | 46.2        | 84.57       |\n| thenlper/gte-small                             | 2.14 s   | 82.07       | 72.31       | 44.89       | 83.54       |\n| universal-sentence-encoder-large/5             | 0.769 s  | 74.05       | 67.9        | 37.82       | 79.53       |\n| universal-sentence-encoder-multilingual-large/3| 1.02 s   | 75.35       | 65.78       | 35.06       | 79.62       |\n| universal-sentence-encoder-multilingual/3      | 0.162 s  | 75.39       | 63.42       | 34.82       | 75.43       |\n| universal-sentence-encoder/4                   | 0.019 s  | 72.04       | 64.45       | 35.71       | 76.23       |\n\n*Relative embeddings latency on CPU*\n![Embeddings Latency on CPU](images/speed_cpu.png)\n\n*Relative embeddings latency on GPU*\n![Embeddings Latency on GPU](images/speed_gpu.png)\n\n\n![Embeddings Quality](images/quality.png)\n\n![Scatter of Embeddings](images/scatter.png)\n\n\n\n## Vector search performance analysis\n\nVectorDB is also optimized for speed of retrieval. We automatically uses [Faiss](https://github.com/facebookresearch/faiss) for low number of chunks (\u003c4000) and [mrpt](https://github.com/vioshyvo/mrpt) for high number of chunks to ensure maximum performance across the spectrum of use cases.\n\n![Vector search engine comparison](images/comparison.png)\n\n## License\n\nMIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkagisearch%2Fvectordb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkagisearch%2Fvectordb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkagisearch%2Fvectordb/lists"}