{"id":26302687,"url":"https://github.com/datastax/jvector","last_synced_at":"2026-04-03T23:02:03.221Z","repository":{"id":190541178,"uuid":"682834027","full_name":"datastax/jvector","owner":"datastax","description":"JVector: the most advanced embedded vector search engine","archived":false,"fork":false,"pushed_at":"2026-03-31T23:41:45.000Z","size":16318,"stargazers_count":1695,"open_issues_count":53,"forks_count":149,"subscribers_count":34,"default_branch":"main","last_synced_at":"2026-04-01T01:24:22.262Z","etag":null,"topics":["ann","java","knn","machine-learning","search-engine","similarity-search","vector-search"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datastax.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE.txt","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-08-25T01:45:20.000Z","updated_at":"2026-03-30T10:47:22.000Z","dependencies_parsed_at":"2025-08-26T23:44:40.637Z","dependency_job_id":"5c565306-0978-4248-9fb8-0b778a177ffd","html_url":"https://github.com/datastax/jvector","commit_stats":null,"previous_names":["jbellis/jvector","datastax/jvector"],"tags_count":55,"template":false,"template_full_name":null,"purl":"pkg:github/datastax/jvector","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datastax%2Fjvector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datastax%2Fjvector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datastax%2Fjvector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datastax%2Fjvector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datastax","download_url":"https://codeload.github.com/datastax/jvector/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datastax%2Fjvector/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31355719,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-03T08:03:20.796Z","status":"ssl_error","status_checked_at":"2026-04-03T08:00:37.834Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ann","java","knn","machine-learning","search-engine","similarity-search","vector-search"],"created_at":"2025-03-15T08:00:55.916Z","updated_at":"2026-04-03T23:02:03.211Z","avatar_url":"https://github.com/datastax.png","language":"Java","readme":"## Introduction to approximate nearest neighbor search\n\nExact nearest neighbor search (k-nearest-neighbor or KNN) is prohibitively expensive at higher dimensions, because approaches to segment the search space that work in 2D or 3D like quadtree or k-d tree devolve to linear scans at higher dimensions.  This is one aspect of what is called “[the curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality).”\n\nWith larger datasets, it is almost always more useful to get an approximate answer in logarithmic time, than the exact answer in linear time.  This is abbreviated as ANN (approximate nearest neighbor) search.\n\nThere are two broad categories of ANN index:\n* Partition-based indexes, like [LSH or IVF](https://www.datastax.com/guides/what-is-a-vector-index) or [SCANN](https://github.com/google-research/google-research/tree/master/scann)\n* Graph indexes, like [HNSW](https://arxiv.org/abs/1603.09320) or [DiskANN](https://www.microsoft.com/en-us/research/project/project-akupara-approximate-nearest-neighbor-search-for-large-scale-semantic-search/)\n\nGraph-based indexes tend to be simpler to implement and faster, but more importantly they can be constructed and updated incrementally.  This makes them a much better fit for a general-purpose index than partitioning approaches that only work on static datasets that are completely specified up front.  That is why all the major commercial vector indexes use graph approaches.\n\nJVector is a graph index that merges the DiskANN and HNSW family trees.\nJVector borrows the hierarchical structure from HNSW, and uses Vamana (the algorithm behind DiskANN) within each layer.\n\n\n## JVector Architecture\n\nJVector is a graph-based index that builds on the HNSW and DiskANN designs with composable extensions.\n\nJVector implements a multi-layer graph with nonblocking concurrency control, allowing construction to scale linearly with the number of cores:\n![JVector scales linearly as thread count increases](https://github.com/jbellis/jvector/assets/42158/f0127bfc-6c45-48b9-96ea-95b2120da0d9)\n\nThe upper layers of the hierarchy are represented by an in-memory adjacency list per node. This allows for quick navigation with no IOs.\nThe bottom layer of the graph is represented by an on-disk adjacency list per node. JVector uses additional data stored inline to support two-pass searches, with the first pass powered by lossily compressed representations of the vectors kept in memory, and the second by a more accurate representation read from disk.  The first pass can be performed with\n* Product quantization (PQ), optionally with [anisotropic weighting](https://arxiv.org/abs/1908.10396)\n* [Binary quantization](https://huggingface.co/blog/embedding-quantization) (BQ)\n* Fused PQ, where PQ codebooks are written inline with the graph adjacency list\n\nThe second pass can be performed with\n* Full resolution float32 vectors\n* NVQ, which uses a non-uniform technique to quantize vectors with high-accuracy\n\n[This two-pass design reduces memory usage and reduces latency while preserving accuracy](https://thenewstack.io/why-vector-size-matters/).  \n\nAdditionally, JVector is unique in offering the ability to construct the index itself using two-pass searches, allowing larger-than-memory indexes to be built:\n![Much larger indexes](https://github.com/jbellis/jvector/assets/42158/34cb8094-68fa-4dc3-b3ce-4582fdbd77e1)\n\nThis is important because it allows you to take advantage of logarithmic search within a single index, instead of spilling over to linear-time merging of results from multiple indexes.\n\n\n## Getting started with JVector\n\nIntroductory tutorials for JVector are available in [docs/tutorials](./docs/tutorials/). Start with the [basic tutorial](./docs/tutorials/1-intro-tutorial.md) or review [VectorIntro.java](./jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial/VectorIntro.java) for a simple example using JVector.\n\nThe older step-by-step guide for JV can be found [here](./docs/legacy/jvector-step-by-step.md). New users should start with the tutorials mentioned earler, but the step-by-step guide contains useful commentary for advanced users.\n\n\n## The research behind the algorithms\n\n* Foundational work: [HNSW](https://ieeexplore.ieee.org/abstract/document/8594636) and [DiskANN](https://suhasjs.github.io/files/diskann_neurips19.pdf) papers, and [a higher level explainer](https://www.datastax.com/guides/hierarchical-navigable-small-worlds)\n* [Anisotropic PQ paper](https://arxiv.org/abs/1908.10396)\n* [Quicker ADC paper](https://arxiv.org/abs/1812.09162)\n* [NVQ paper](https://arxiv.org/abs/2509.18471)\n\n## Developing and Testing\nThis project is organized as a [multimodule Maven build](https://maven.apache.org/guides/mini/guide-multiple-modules.html). The intent is to produce a multirelease jar suitable for use as\na dependency from any Java 11 code. When run on a Java 20+ JVM with the Vector module enabled, optimized vector\nproviders will be used. In general, the project is structured to be built with JDK 20+, but when `JAVA_HOME` is set to\nJava 11 -\u003e Java 19, certain build features will still be available.\n\nBase code is in [jvector-base](./jvector-base) and will be built for Java 11 releases, restricting language features and APIs\nappropriately. Code in [jvector-twenty](./jvector-twenty) will be compiled for Java 20 language features/APIs and included in the final\nmultirelease jar targeting supported JVMs. [jvector-multirelease](./jvector-multirelease) packages [jvector-base](./jvector-base) and [jvector-twenty](./jvector-twenty) as a\nmultirelease jar for release. [jvector-examples](./jvector-examples) is an additional sibling module that uses the reactor-representation of\njvector-base/jvector-twenty to run example code. [jvector-tests](./jvector-tests) contains tests for the project, capable of running against\nboth Java 11 and Java 20+ JVMs.\n\nTo run tests, use `mvn test`. To run tests against Java 20+, use `mvn test`. To run tests against Java 11, use `mvn -Pjdk11 test`.\nTo run a single test class, use the Maven Surefire test filtering capability, e.g.,\n`mvn -Dsurefire.failIfNoSpecifiedTests=false -Dtest=TestNeighborArray test`.\nYou may also use method-level filtering and patterns, e.g.,\n`mvn -Dsurefire.failIfNoSpecifiedTests=false -Dtest=TestNeighborArray#testRetain* test`.\n(The `failIfNoSpecifiedTests` option works around a quirk of surefire: it is happy to run `test` with submodules with empty test sets,\nbut as soon as you supply a filter, it wants at least one match in every submodule.)\n\nYou can run `SiftSmall` and `Bench` directly to get an idea of what all is going on here. `Bench` will automatically download required datasets to the `fvec` and `hdf5` directories.\nThe files used by `SiftSmall` can be found in the [siftsmall directory](./siftsmall) in the project root.\n\nTo run either class, you can use the Maven exec-plugin via the following incantations:\n\n\u003e `mvn compile exec:exec@bench`\n\nor for Sift:\n\n\u003e `mvn compile exec:exec@sift`\n\n`Bench` takes an optional `benchArgs` argument that can be set to a list of whitespace-separated regexes. If any of the\nprovided regexes match within a dataset name, that dataset will be included in the benchmark. For example, to run only the glove\nand nytimes datasets, you could use:\n\n\u003e `mvn compile exec:exec@bench -DbenchArgs=\"glove nytimes\"`\n\nTo run Sift/Bench without the JVM vector module available, you can use the following invocations:\n\n\u003e `mvn -Pjdk11 compile exec:exec@bench`\n\n\u003e `mvn -Pjdk11 compile exec:exec@sift`\n\nThe `... -Pjdk11` invocations will also work with `JAVA_HOME` pointing at a Java 11 installation.\n\nFor more information on running benchmarks, go through [docs/benchmarking.md](./docs/benchmarking.md).\n\nTo release, configure `~/.m2/settings.xml` to point to OSSRH and run `mvn -Prelease clean deploy`.\n\n---\n","funding_links":[],"categories":["Awesome Vector Search Engine","Java","数据库"],"sub_categories":["Standalone Service"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatastax%2Fjvector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatastax%2Fjvector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatastax%2Fjvector/lists"}