{"id":13647643,"url":"https://github.com/timescale/pgvectorscale","last_synced_at":"2025-05-14T15:02:22.450Z","repository":{"id":242935643,"uuid":"660831021","full_name":"timescale/pgvectorscale","owner":"timescale","description":"A complement to pgvector for high performance, cost efficient vector search on large workloads.","archived":false,"fork":false,"pushed_at":"2025-05-06T22:34:27.000Z","size":745,"stargazers_count":1942,"open_issues_count":26,"forks_count":88,"subscribers_count":21,"default_branch":"main","last_synced_at":"2025-05-07T04:58:49.431Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"postgresql","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/timescale.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-07-01T01:05:05.000Z","updated_at":"2025-05-06T21:37:30.000Z","dependencies_parsed_at":"2024-08-12T22:21:38.187Z","dependency_job_id":"0e8a0635-ed30-4e7f-a92f-ebbea5180895","html_url":"https://github.com/timescale/pgvectorscale","commit_stats":null,"previous_names":["timescale/pgvectorscale"],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timescale%2Fpgvectorscale","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timescale%2Fpgvectorscale/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timescale%2Fpgvectorscale/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timescale%2Fpgvectorscale/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/timescale","download_url":"https://codeload.github.com/timescale/pgvectorscale/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254168653,"owners_count":22026205,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T01:03:41.184Z","updated_at":"2025-05-14T15:02:22.135Z","avatar_url":"https://github.com/timescale.png","language":"Rust","funding_links":[],"categories":["Rust","5. Retrieval-Augmented Generation (RAG) \u0026 Knowledge","Vector Database Extensions","Vector Databases"],"sub_categories":[],"readme":"\u003cp\u003e\u003c/p\u003e\n\u003cdiv align=center\u003e\n\n# pgvectorscale\n\n\u003ch3\u003epgvectorscale builds on pgvector with higher performance embedding search and cost-efficient storage for AI applications. \u003c/h3\u003e\n\n[![Discord](https://img.shields.io/badge/Join_us_on_Discord-black?style=for-the-badge\u0026logo=discord\u0026logoColor=white)](https://discord.gg/KRdHVXAmkp)\n[![Try Timescale for free](https://img.shields.io/badge/Try_Timescale_for_free-black?style=for-the-badge\u0026logo=timescale\u0026logoColor=white)](https://tsdb.co/gh-pgvector-signup)\n\u003c/div\u003e\n\npgvectorscale complements [pgvector][pgvector], the open-source vector data extension for PostgreSQL, and introduces the following key innovations for pgvector data:\n- A new index type called StreamingDiskANN, inspired by the [DiskANN](https://github.com/microsoft/DiskANN) algorithm, based on research from Microsoft.\n- Statistical Binary Quantization: developed by Timescale researchers, This compression method improves on standard Binary Quantization.\n- Label-based filtered vector search: based on Microsoft's Filtered DiskANN research, this allows you to combine vector similarity search with label filtering for more precise and efficient results.\n\nOn a benchmark dataset of 50 million Cohere embeddings with 768 dimensions\neach, PostgreSQL with `pgvector` and `pgvectorscale` achieves **28x lower p95\nlatency** and **16x higher query throughput** compared to Pinecone's storage\noptimized (s1) index for approximate nearest neighbor queries at 99% recall,\nall at 75% less cost when self-hosted on AWS EC2.\n\n\u003cdiv align=center\u003e\n\n![Benchmarks](https://assets.timescale.com/docs/images/benchmark-comparison-pgvectorscale-pinecone.png)\n\n\u003c/div\u003e\n\nTo learn more about the performance impact of pgvectorscale, and details about benchmark methodology and results, see the [pgvector vs Pinecone comparison blog post](http://www.timescale.com/blog/pgvector-vs-pinecone).\n\nIn contrast to pgvector, which is written in C, pgvectorscale is developed in [Rust][rust-language] using the [PGRX framework](https://github.com/pgcentralfoundation/pgrx),\noffering the PostgreSQL community a new avenue for contributing to vector support.\n\n**Application developers or DBAs** can use pgvectorscale with their PostgreSQL databases.\n   * [Install pgvectorscale](#installation)\n   * [Get started using pgvectorscale](#get-started-with-pgvectorscale)\n\nIf you **want to contribute** to this extension, see how to [build pgvectorscale from source in a developer environment](./DEVELOPMENT.md).\n\nFor production vector workloads, get **private beta access to vector-optimized databases** with pgvector and pgvectorscale on Timescale. [Sign up here for priority access](https://timescale.typeform.com/to/H7lQ10eQ).\n\n## Installation\n\nThe fastest ways to run PostgreSQL with pgvectorscale are:\n\n* [Using a pre-built Docker container](#using-a-pre-built-docker-container)\n* [Installing from source](#installing-from-source)\n* [Enable pgvectorscale in a Timescale Cloud service](#enable-pgai-in-a-timescale-cloud-service)\n\n### Using a pre-built Docker container\n\n1.  [Run the TimescaleDB Docker image](https://docs.timescale.com/self-hosted/latest/install/installation-docker/).\n\n1. Connect to your database:\n   ```bash\n   psql -d \"postgres://\u003cusername\u003e:\u003cpassword\u003e@\u003chost\u003e:\u003cport\u003e/\u003cdatabase-name\u003e\"\n   ```\n\n1. Create the pgvectorscale extension:\n\n    ```sql\n    CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;\n    ```\n\n   The `CASCADE` automatically installs `pgvector`.\n\n### Installing from source\n\nYou can install pgvectorscale from source and install it in an existing PostgreSQL server\n\n\u003e [!WARNING]\n\u003e Building pgvectorscale on macOS X86 (Intel) machines is currently not\n\u003e supported due to an [open issue][macos-x86-issue]. As alternatives, you can:\n\u003e\n\u003e - Use an ARM-based Mac.\n\u003e - Build using Linux.\n\u003e - Use our pre-built Docker containers.\n\u003e\n\u003e We welcome community contributions to resolve this limitation. If you're\n\u003e interested in helping, please check the issue for details.\n\n1. Compile and install the extension\n\n    ```bash\n    # install rust\n    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\n\n    # download pgvectorscale\n    cd /tmp\n    git clone --branch \u003cversion\u003e https://github.com/timescale/pgvectorscale\n    cd pgvectorscale/pgvectorscale\n    # install cargo-pgrx with the same version as pgrx\n    cargo install --locked cargo-pgrx --version $(cargo metadata --format-version 1 | jq -r '.packages[] | select(.name == \"pgrx\") | .version')\n    cargo pgrx init --pg17 pg_config\n    # build and install pgvectorscale\n    cargo pgrx install --release\n    ```\n\n    You can also take a look at our [documentation for extension developers](./DEVELOPMENT.md) for more complete instructions.\n\n1. Connect to your database:\n   ```bash\n   psql -d \"postgres://\u003cusername\u003e:\u003cpassword\u003e@\u003chost\u003e:\u003cport\u003e/\u003cdatabase-name\u003e\"\n   ```\n\n1. Ensure the pgvector extension is available:\n\n   ```sql\n   SELECT * FROM pg_available_extensions WHERE name = 'vector';\n   ```\n\n   If pgvector is not available, install it using the [pgvector installation\n   instructions][pgvector-install].\n\n\n1. Create the pgvectorscale extension:\n\n    ```sql\n    CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;\n    ```\n\n   The `CASCADE` automatically installs `pgvector`.\n\n### Enable pgvectorscale in a Timescale Cloud service\n\nNote: the instructions below are for Timescale's standard compute instance. For production vector workloads, we’re offering **private beta access to vector-optimized databases** with pgvector and pgvectorscale on Timescale. [Sign up here for priority access](https://timescale.typeform.com/to/H7lQ10eQ).\n\nTo enable pgvectorscale:\n\n1. Create a new [Timescale Service](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch).\n\n   If you want to use an existing service, pgvectorscale is added as an available extension on the first maintenance window\n   after the pgvectorscale release date.\n\n1. Connect to your Timescale service:\n   ```bash\n   psql -d \"postgres://\u003cusername\u003e:\u003cpassword\u003e@\u003chost\u003e:\u003cport\u003e/\u003cdatabase-name\u003e\"\n   ```\n\n1. Create the pgvectorscale extension:\n\n    ```postgresql\n    CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;\n    ```\n\n   The `CASCADE` automatically installs `pgvector`.\n\n\n## Get started with pgvectorscale\n\n\n1. Create a table with an embedding column. For example:\n\n    ```postgresql\n    CREATE TABLE IF NOT EXISTS document_embedding  (\n        id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,\n        metadata JSONB,\n        contents TEXT,\n        embedding VECTOR(1536)\n    )\n    ```\n\n1. Populate the table.\n\n   For more information, see the [pgvector instructions](https://github.com/pgvector/pgvector/blob/master/README.md#storing) and [list of clients](https://github.com/pgvector/pgvector/blob/master/README.md#languages).\n1. Create a StreamingDiskANN index on the embedding column:\n    ```postgresql\n    CREATE INDEX document_embedding_idx ON document_embedding\n    USING diskann (embedding vector_cosine_ops);\n    ```\n1. Find the 10 closest embeddings using the index.\n\n    ```postgresql\n    SELECT *\n    FROM document_embedding\n    ORDER BY embedding \u003c=\u003e $1\n    LIMIT 10;\n    ```\n\n    Note: pgvectorscale currently supports: cosine distance (`\u003c=\u003e`) queries, for indices created with `vector_cosine_ops`; L2 distance (`\u003c-\u003e`) queries, for indices created with `vector_l2_ops`; and inner product (`\u003c#\u003e`) queries, for indices created with `vector_ip_ops`.  This is the same syntax used by `pgvector`.  If you would like additional distance types,\n    [create an issue](https://github.com/timescale/pgvectorscale/issues).  (Note: inner product indices are not compatible with plain storage.)\n\n## Filtered Vector Search\n\npgvectorscale supports combining vector similarity search with metadata filtering. There are two basic kinds of filtering, which can be combined in a single query:\n\n1. **Label-based filtering with the diskann index**: This provides optimized performance for filtering by labels.\n2. **Arbitrary WHERE clause filtering**: This uses post-filtering after the vector search.\n\nThe label-based filtering implementation is based on the [Filtered DiskANN](https://dl.acm.org/doi/10.1145/3543507.3583552) approach developed by Microsoft researchers, which enables efficient filtered vector search while maintaining high recall.\n\nThe post-filtering implementation, while slower, is streaming and correct, ensuring accurate results without requiring the entire result set to be loaded into memory.\n\n### Label-based Filtering with diskann\n\nFor optimal performance with label filtering, you must specify the label column directly in the index creation:\n\n1. Create a table with an embedding column and a labels array:\n\n    ```postgresql\n    CREATE TABLE documents (\n        id SERIAL PRIMARY KEY,\n        embedding VECTOR(1536),\n        labels SMALLINT[],  -- Array of category labels\n        status TEXT,\n        created_at TIMESTAMPTZ\n    );\n    ```\n\n2. Create a StreamingDiskANN index on the embedding column, including the labels column:\n\n    ```postgresql\n    CREATE INDEX ON documents USING diskann (embedding vector_cosine_ops, labels);\n    ```\n\n\u003e **Note**: Label values must be within the PostgreSQL `smallint` range (-32768 to 32767). Using `smallint[]` for labels ensures that PostgreSQL's type system will automatically enforce these bounds.\n\u003e \n\u003e pgvectorscale includes an implementation of the `\u0026\u0026` overlap operator for `smallint[]` arrays, which is used for efficient label-based filtering.\n\n3. Perform label-filtered vector searches using the `\u0026\u0026` operator (array overlap):\n\n    ```postgresql\n    -- Find similar documents with specific labels\n    SELECT * FROM documents\n    WHERE labels \u0026\u0026 ARRAY[1, 3]  -- Documents with label 1 OR 3\n    ORDER BY embedding \u003c=\u003e '[...]'\n    LIMIT 10;\n    ```\n\n    The index directly supports this type of filtering, providing significantly lower latency results compared to post-filtering.\n\n#### Giving Semantic Meaning to Labels\n\nWhile the labels must be stored as integers in the array for the index to work efficiently, you can give them semantic meaning by relating them to a separate labels table:\n\n1. Create a labels table with meaningful descriptions:\n\n    ```postgresql\n    CREATE TABLE label_definitions (\n        id INTEGER PRIMARY KEY,\n        name TEXT,\n        description TEXT,\n        attributes JSONB  -- Can store additional metadata about the label\n    );\n\n    -- Insert some label definitions\n    INSERT INTO label_definitions (id, name, description, attributes) VALUES\n    (1, 'science', 'Scientific content', '{\"domain\": \"academic\", \"confidence\": 0.95}'),\n    (2, 'technology', 'Technology-related content', '{\"domain\": \"technical\", \"confidence\": 0.92}'),\n    (3, 'business', 'Business and finance content', '{\"domain\": \"commercial\", \"confidence\": 0.88}');\n    ```\n\n2. When inserting documents, use the appropriate label IDs:\n\n    ```postgresql\n    -- Insert a document with science and technology labels\n    INSERT INTO documents (embedding, labels)\n    VALUES ('[...]', ARRAY[1, 2]);\n    ```\n\n3. When querying, you can join with the labels table to work with meaningful names:\n\n    ```postgresql\n    -- Find similar science documents and include label information\n    SELECT d.*, array_agg(l.name) as label_names\n    FROM documents d\n    JOIN label_definitions l ON l.id = ANY(d.labels)\n    WHERE d.labels \u0026\u0026 ARRAY[1]  -- Science label\n    GROUP BY d.id, d.embedding, d.labels, d.status, d.created_at\n    ORDER BY d.embedding \u003c=\u003e '[...]'\n    LIMIT 10;\n    ```\n\n4. You can also convert between label names and IDs when filtering:\n\n    ```postgresql\n    -- Find documents with specific label names\n    SELECT d.*\n    FROM documents d\n    WHERE d.labels \u0026\u0026 (\n        SELECT array_agg(id)\n        FROM label_definitions\n        WHERE name IN ('science', 'business')\n    )\n    ORDER BY d.embedding \u003c=\u003e '[...]'\n    LIMIT 10;\n    ```\n\nThis approach gives you the performance benefits of integer-based label filtering while still allowing you to work with semantically meaningful labels in your application.\n\n### Arbitrary WHERE Clause Filtering\n\nYou can also use any PostgreSQL WHERE clause with vector search, but these conditions will be applied as post-filtering:\n\n```postgresql\n-- Find similar documents with specific status and date range\nSELECT * FROM documents\nWHERE status = 'active' AND created_at \u003e '2024-01-01'\nORDER BY embedding \u003c=\u003e '[...]'\nLIMIT 10;\n```\n\nFor these arbitrary conditions, the vector search happens first, and then the WHERE conditions are applied to the results. For best performance with frequently used filters, consider using the label-based approach described above.\n\n## Tuning\n\nThe StreamingDiskANN index comes with **smart defaults** but also the ability to customize its behavior. There are two types of parameters: index build-time parameters that are specified when an index is created and query-time parameters that can be tuned when querying an index.\n\nWe suggest setting the index build-time paramers for major changes to index operations while query-time parameters can be used to tune the accuracy/performance tradeoff for individual queries.\n\n We expect most people to tune the query-time parameters (if any) and leave the index build time parameters set to default.\n\n### StreamingDiskANN index build-time parameters\n\nThese parameters can be set when an index is created.\n\n| Parameter name   | Description                                                                                                                                                    | Default value |\n|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|\n| `storage_layout` | `memory_optimized` which uses SBQ to compress vector data or `plain` which stores data uncompressed | memory_optimized\n| `num_neighbors`    | Sets the maximum number of neighbors per node. Higher values increase accuracy but make the graph traversal slower.                                           | 50            |\n| `search_list_size` | This is the S parameter used in the greedy search algorithm used during construction. Higher values improve graph quality at the cost of slower index builds. | 100           |\n| `max_alpha`        | Is the alpha parameter in the algorithm. Higher values improve graph quality at the cost of slower index builds.                                              | 1.2           |\n| `num_dimensions` | The number of dimensions to index. By default, all dimensions are indexed. But you can also index less dimensions to make use of [Matryoshka embeddings](https://huggingface.co/blog/matryoshka) | 0 (all dimensions)\n| `num_bits_per_dimension` | Number of bits used to encode each dimension when using SBQ | 2 for less than 900 dimensions, 1 otherwise\n\nAn example of how to set the `num_neighbors` parameter is:\n\n```sql\nCREATE INDEX document_embedding_idx ON document_embedding\nUSING diskann (embedding) WITH(num_neighbors=50);\n```\n\nAn example of creating an index with label-based filtering:\n\n```sql\nCREATE INDEX document_embedding_idx ON document_embedding\nUSING diskann (embedding vector_cosine_ops, labels);\n```\n\n#### StreamingDiskANN query-time parameters\n\nYou can also set two parameters to control the accuracy vs. query speed trade-off at query time. We suggest adjusting `diskann.query_rescore` to fine-tune accuracy.\n\n| Parameter name   | Description                                                                                                                                                    | Default value |\n|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|\n| `diskann.query_search_list_size` | The number of additional candidates considered during the graph search. | 100\n| `diskann.query_rescore` | The number of elements rescored (0 to disable rescoring) | 50\n\n\nYou can set the value by using `SET` before executing a query. For example:\n\n```sql\nSET diskann.query_rescore = 400;\n```\n\nNote the [SET command](https://www.postgresql.org/docs/current/sql-set.html) applies to the entire session (database connection) from the point of execution. You can use a transaction-local variant using `LOCAL` which will\nbe reset after the end of the transaction:\n\n```sql\nBEGIN;\nSET LOCAL diskann.query_search_list_size= 10;\nSELECT * FROM document_embedding ORDER BY embedding \u003c=\u003e $1 LIMIT 10\nCOMMIT;\n```\n\n## Null Value Handling\n\n* Null vectors are not indexed\n* Null labels are treated as empty arrays\n* Null values in label arrays are ignored\n\n## ORDER BY vector distance\n\npgvectorscale's diskann index uses relaxed ordering which allows results to be\nslightly out of order by distance. This is analogous to using\n[`iterative scan with relaxed ordering`][pgvector-iterative-index-scan] with\npgvector's ivfflat or hnsw indexes.\n\nIf you need strict ordering you can use a [materialized CTE][materialized-cte]:\n\n```sql\nWITH relaxed_results AS MATERIALIZED (\n    SELECT id, embedding \u003c=\u003e '[1,2,3]' AS distance\n    FROM items\n    WHERE category_id = 123\n    ORDER BY distance\n    LIMIT 5\n) SELECT * FROM relaxed_results ORDER BY distance;\n```\n\n## Index on an UNLOGGED table\n\nCreating an index on an UNLOGGED table is currently not supported.\nTrying will yield the error:\n\n```\nERROR:  ambuildempty: not yet implemented\n```\n\n## Get involved\n\npgvectorscale is still at an early stage. Now is a great time to help shape the\ndirection of this project; we are currently deciding priorities. Have a look at the\nlist of features we're thinking of working on. Feel free to comment, expand\nthe list, or hop on the Discussions forum.\n\n## About Timescale\n\nTimescale is a PostgreSQL cloud company. To learn more visit the [timescale.com](https://www.timescale.com).\n\n[Timescale Cloud](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch) is a high-performance, developer focused, cloud platform that provides PostgreSQL services for the most demanding AI, time-series, analytics, and event workloads. Timescale Cloud is ideal for production applications and provides high availability, streaming backups, upgrades over time, roles and permissions, and great security.\n\n[pgvector]: https://github.com/pgvector/pgvector/blob/master/README.md\n[rust-language]: https://www.rust-lang.org/\n[pgvector-install]: https://github.com/pgvector/pgvector?tab=readme-ov-file#installation\n[pgvector-iterative-index-scan]: https://github.com/pgvector/pgvector?tab=readme-ov-file#iterative-index-scans\n[materialized-cte]: https://www.postgresql.org/docs/current/queries-with.html#QUERIES-WITH-CTE-MATERIALIZATION\n[macos-x86-issue]: https://github.com/timescale/pgvectorscale/issues/155\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimescale%2Fpgvectorscale","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftimescale%2Fpgvectorscale","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimescale%2Fpgvectorscale/lists"}