{"id":19747966,"url":"https://github.com/tensorchord/pg_bestmatch.rs","last_synced_at":"2025-04-09T13:11:27.509Z","repository":{"id":242614589,"uuid":"806407751","full_name":"tensorchord/pg_bestmatch.rs","owner":"tensorchord","description":"Generate BM25 sparse vector inside PostgreSQL","archived":false,"fork":false,"pushed_at":"2024-11-05T04:02:56.000Z","size":1567,"stargazers_count":63,"open_issues_count":10,"forks_count":11,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-02T08:48:10.353Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tensorchord.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-27T06:29:54.000Z","updated_at":"2025-03-07T13:01:54.000Z","dependencies_parsed_at":"2024-06-04T02:40:27.010Z","dependency_job_id":"8d3b1c1f-4c8d-464b-a9bf-a09aa2cfe1f5","html_url":"https://github.com/tensorchord/pg_bestmatch.rs","commit_stats":null,"previous_names":["tensorchord/pg_bestmatch.rs"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorchord%2Fpg_bestmatch.rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorchord%2Fpg_bestmatch.rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorchord%2Fpg_bestmatch.rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorchord%2Fpg_bestmatch.rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tensorchord","download_url":"https://codeload.github.com/tensorchord/pg_bestmatch.rs/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248045266,"owners_count":21038555,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T02:19:38.281Z","updated_at":"2025-04-09T13:11:27.493Z","avatar_url":"https://github.com/tensorchord.png","language":"Rust","funding_links":[],"categories":["Rust"],"sub_categories":[],"readme":"# pg_bestmatch.rs\n\nThis PostgreSQL extension provides functionalities for BM25 text queries, generateing BM25 statistic sparse vectors for text. BM25 outperforms dense vector-based retrieval methods in many [RAG benchmark tasks](https://hazyresearch.stanford.edu/blog/2024-05-20-m2-bert-retrieval).\n\nUser can use vector search extensions such as `pgvecto.rs` or `pgvector` for efficient searches in postgres.\n\n\u003e [!IMPORTANT]  \n\u003e Based on our initial tests, HNSW indexing does not support the sparse vectors generated by BM25 very well. The high sparsity prevents effective navigation within the graph.\n\n\n* [Installation](#installation)\n* [How does it work?](#how-does-it-work)\n* [Usage](#usage)\n* [Build from source](#build-from-source)\n* [Comparison with pg_search](#comparison-with-pg_search)\n* [Reference](#Reference)\n\n## Installation\n\n```sql\nCREATE EXTENSION pg_bestmatch;\nSET search_path TO public, bm_catalog;\n```\n\n## How does it work?\n- Create an BM25 statistics based on your document set by `bm25_create(table_name, column_name, statistic_name);`. It will create a materilized view to record the stats. \n- Generate document sparse vector by `bm25_document_to_svector(statistic_name, passage)`\n- For query, generate query sparse vector `bm25_query_to_svector(statistic_name, query)`\n- Calculate the score by dot product between the query sparse vector and the document sparse vector\n- Currently we use huggingface tokenizer with `bert-base-uncased` vocabulary set to tokenize words. Might support more configuration on tokenizer in the future.\n\n## Usage\n\nHere is an example workflow demonstrating the usage of this extension with the example of [Stanford LoCo benchmark](https://hazyresearch.stanford.edu/blog/2024-05-20-m2-bert-retrieval).\n\n0. Load the dataset. Here is a script for you if you want to experience `pg_bestmatch` with the dataset.\n\n```sh\nwget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Documents/parquet/default/test/0.parquet -O documents.parquet\nwget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Queries/parquet/default/test/0.parquet -O queries.parquet\n```\n\n```python\nimport pandas as pd\nfrom sqlalchemy import create_engine\nimport numpy as np\nfrom psycopg2.extensions import register_adapter, AsIs\n\ndef adapter_numpy_float64(numpy_float64):\n    return AsIs(numpy_float64)\n\ndef adapter_numpy_int64(numpy_int64):\n    return AsIs(numpy_int64)\n\ndef adapter_numpy_float32(numpy_float32):\n    return AsIs(numpy_float32)\n\ndef adapter_numpy_int32(numpy_int32):\n    return AsIs(numpy_int32)\n\ndef adapter_numpy_array(numpy_array):\n    return AsIs(tuple(numpy_array))\n\nregister_adapter(np.float64, adapter_numpy_float64)\nregister_adapter(np.int64, adapter_numpy_int64)\nregister_adapter(np.float32, adapter_numpy_float32)\nregister_adapter(np.int32, adapter_numpy_int32)\nregister_adapter(np.ndarray, adapter_numpy_array)\n\ndb_url = \"postgresql://localhost:5432/pg_bestmatch_test\"\nengine = create_engine(db_url)\n\ndef load_documents():\n    df = pd.read_parquet(\"documents.parquet\")\n    df.to_sql(\"documents\", engine, if_exists='replace', index=False)\n\ndef load_queries():\n    df = pd.read_parquet(\"queries.parquet\")\n    df['answer_pids'] = df['answer_pids'].apply(lambda x: str(x[0]))    \n    df.to_sql(\"queries\", engine, if_exists='replace', index=False)\n\nload_documents()\nload_queries()\n```\n\n1. Create BM25 statistics for the `documents` table.\n\n```sql\nSELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 0.75, 1.2);\n```\n\n2. Add an embedding column to the `documents` and `queries` tables and update the embeddings for documents and queries.\n\n```sql\nALTER TABLE documents ADD COLUMN embedding svector; -- for pgvecto.rs users\nALTER TABLE documents ADD COLUMN embedding sparsevec; -- for pgvector users\n\nUPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage)::svector; -- for pgvecto.rs users\nUPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage, 'pgvector')::sparsevec; -- for pgvector users\n```\n\n3. (Optional) Create a vector index on the sparse vector column.\n\n```sql\nCREATE INDEX ON documents USING vectors (embedding svector_dot_ops); -- for pgvecto.rs users\nCREATE INDEX ON documents USING ivfflat (embedding sparsevec_ip_ops); -- for pgvector users\n```\n\n4. Perform a vector search to find the most relevant documents for each query.\n\n```sql\nALTER TABLE queries ADD COLUMN embedding svector; -- for pgvecto.rs users\nALTER TABLE queries ADD COLUMN embedding sparsevec; -- for pgvector users\n\nUPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query)::svector; -- for pgvecto.rs users\nUPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query, 'pgvector')::sparsevec; -- for pgvector users\n\nSELECT sum((array[answer_pids] = array(SELECT pid FROM documents WHERE queries.dataset = documents.dataset ORDER BY queries.embedding \u003c#\u003e documents.embedding LIMIT 1))::int) FROM queries;\n```\n\nThis workflow showcases how to leverage BM25 text queries and vector search in PostgreSQL using this extension. The Top 1 recall of BM25 on this dataset is `0.77`. If you reproduce the result, your operations are correct.\n\n\n## Build from source\n\nBefore building, you should have `PostgreSQL`, `Rust` and `Cargo` installed on your system.\n\n1. Install `cargo-pgrx`.\n\n```sh\ncargo install cargo-pgrx --version v0.12.0-alpha.1\n```\n\n2. Initialize `cargo-pgrx`.\n\n```sh\ncargo pgrx init --pg16=$(which pg_config)   # assuming that you have PostgreSQL 16 installed\n```\n\n3. Build.\n\n```sh\ncargo pgrx install --release    # if you want to install it on your machine\ncargo pgrx package  # if you want to package `pg_bestmatch`\n```\n\n## Comparison with pg_search \n- `pg_bestmatch.rs` only provides methods for generating sparse vectors and does not support index-based search (which can be achieved by pgvecto.rs or pgvector). \n- `pg_search` performs BM25 retrieval via the external `tantivy` engine, which may have limitations when combined with transactions, filters, or JOIN operations. Since `pg_bestmatch.rs` is entirely native to Postgres, it offers full compatibility with these operations inside postgres.\n\n## Reference\n\n- `tokenize`\n  - Description: Tokenizes an input string into individual tokens.\n  - Example:\n    ```sql\n    SELECT tokenize('i have an apple'); -- result: {i,have,an,apple}\n    ```\n- `bm25_create`\n  - Description: Creates BM25 statistics for a specified table and column.\n  - Usage: \n    ```sql\n    SELECT bm25_create('documents', 'passage', 'documents_passage_bm25');\n    ```\n  - Parameters:\n    - `table_name`: Name of the table.\n    - `column_name`: Name of the column.\n    - `stat_name`: Name of the BM25 statistics.\n    - `b`: BM25 parameter (default 0.75).\n    - `k`: BM25 parameter (default 1.2).\n- `bm25_refresh`\n  - Description: Updates the BM25 statistics to reflect any changes in the underlying data.\n  - Usage:\n    ```sql\n    SELECT bm25_refresh('documents_passage_bm25');\n    ```\n  - Parameters:\n    - `stat_name`: Name of the BM25 statistics to update.\n- `bm25_drop`\n  - Description: Deletes the BM25 statistics for a specified table and column.\n  - Usage:\n    ```sql\n    SELECT bm25_drop('documents_passage_bm25');\n    ```\n  - Parameters:\n    - `stat_name`: Name of the BM25 statistics to delete.\n- `bm25_document_to_svector`\n  - Description: Converts document text into a sparse vector representation.\n  - Usage:\n    ```sql\n    SELECT bm25_document_to_svector('documents_passage_bm25', 'document_text');\n    ```\n  - Parameters:\n    - `stat_name`: Name of the BM25 statistics.\n    - `document_text`: The text of the document.\n    - `style`: Emits `pgvecto.rs`-style sparse vector or `pgvector`-style sparse vector.\n- `bm25_query_to_svector`\n  - Description: Converts query text into a sparse vector representation.\n  - Usage:\n    ```sql\n    SELECT bm25_query_to_svector('documents_passage_bm25', 'We begin, as always, with the text.');\n    ```\n  - Parameters:\n    - `stat_name`: Name of the BM25 statistics.\n    - `query_text`: The text of the query.\n    - `style`: Emits `pgvecto.rs`-style sparse vector or `pgvector`-style sparse vector.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftensorchord%2Fpg_bestmatch.rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftensorchord%2Fpg_bestmatch.rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftensorchord%2Fpg_bestmatch.rs/lists"}