https://github.com/tensorchord/pg_bestmatch.rs
Generate BM25 sparse vector inside PostgreSQL
https://github.com/tensorchord/pg_bestmatch.rs
Last synced: about 1 year ago
JSON representation
Generate BM25 sparse vector inside PostgreSQL
- Host: GitHub
- URL: https://github.com/tensorchord/pg_bestmatch.rs
- Owner: tensorchord
- License: apache-2.0
- Created: 2024-05-27T06:29:54.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-11-05T04:02:56.000Z (over 1 year ago)
- Last Synced: 2025-04-02T08:48:10.353Z (over 1 year ago)
- Language: Rust
- Homepage:
- Size: 1.49 MB
- Stars: 63
- Watchers: 4
- Forks: 11
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pg_bestmatch.rs
This PostgreSQL extension provides functionalities for BM25 text queries, generateing BM25 statistic sparse vectors for text. BM25 outperforms dense vector-based retrieval methods in many [RAG benchmark tasks](https://hazyresearch.stanford.edu/blog/2024-05-20-m2-bert-retrieval).
User can use vector search extensions such as `pgvecto.rs` or `pgvector` for efficient searches in postgres.
> [!IMPORTANT]
> Based on our initial tests, HNSW indexing does not support the sparse vectors generated by BM25 very well. The high sparsity prevents effective navigation within the graph.
* [Installation](#installation)
* [How does it work?](#how-does-it-work)
* [Usage](#usage)
* [Build from source](#build-from-source)
* [Comparison with pg_search](#comparison-with-pg_search)
* [Reference](#Reference)
## Installation
```sql
CREATE EXTENSION pg_bestmatch;
SET search_path TO public, bm_catalog;
```
## How does it work?
- Create an BM25 statistics based on your document set by `bm25_create(table_name, column_name, statistic_name);`. It will create a materilized view to record the stats.
- Generate document sparse vector by `bm25_document_to_svector(statistic_name, passage)`
- For query, generate query sparse vector `bm25_query_to_svector(statistic_name, query)`
- Calculate the score by dot product between the query sparse vector and the document sparse vector
- Currently we use huggingface tokenizer with `bert-base-uncased` vocabulary set to tokenize words. Might support more configuration on tokenizer in the future.
## Usage
Here is an example workflow demonstrating the usage of this extension with the example of [Stanford LoCo benchmark](https://hazyresearch.stanford.edu/blog/2024-05-20-m2-bert-retrieval).
0. Load the dataset. Here is a script for you if you want to experience `pg_bestmatch` with the dataset.
```sh
wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Documents/parquet/default/test/0.parquet -O documents.parquet
wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Queries/parquet/default/test/0.parquet -O queries.parquet
```
```python
import pandas as pd
from sqlalchemy import create_engine
import numpy as np
from psycopg2.extensions import register_adapter, AsIs
def adapter_numpy_float64(numpy_float64):
return AsIs(numpy_float64)
def adapter_numpy_int64(numpy_int64):
return AsIs(numpy_int64)
def adapter_numpy_float32(numpy_float32):
return AsIs(numpy_float32)
def adapter_numpy_int32(numpy_int32):
return AsIs(numpy_int32)
def adapter_numpy_array(numpy_array):
return AsIs(tuple(numpy_array))
register_adapter(np.float64, adapter_numpy_float64)
register_adapter(np.int64, adapter_numpy_int64)
register_adapter(np.float32, adapter_numpy_float32)
register_adapter(np.int32, adapter_numpy_int32)
register_adapter(np.ndarray, adapter_numpy_array)
db_url = "postgresql://localhost:5432/pg_bestmatch_test"
engine = create_engine(db_url)
def load_documents():
df = pd.read_parquet("documents.parquet")
df.to_sql("documents", engine, if_exists='replace', index=False)
def load_queries():
df = pd.read_parquet("queries.parquet")
df['answer_pids'] = df['answer_pids'].apply(lambda x: str(x[0]))
df.to_sql("queries", engine, if_exists='replace', index=False)
load_documents()
load_queries()
```
1. Create BM25 statistics for the `documents` table.
```sql
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 0.75, 1.2);
```
2. Add an embedding column to the `documents` and `queries` tables and update the embeddings for documents and queries.
```sql
ALTER TABLE documents ADD COLUMN embedding svector; -- for pgvecto.rs users
ALTER TABLE documents ADD COLUMN embedding sparsevec; -- for pgvector users
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage)::svector; -- for pgvecto.rs users
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage, 'pgvector')::sparsevec; -- for pgvector users
```
3. (Optional) Create a vector index on the sparse vector column.
```sql
CREATE INDEX ON documents USING vectors (embedding svector_dot_ops); -- for pgvecto.rs users
CREATE INDEX ON documents USING ivfflat (embedding sparsevec_ip_ops); -- for pgvector users
```
4. Perform a vector search to find the most relevant documents for each query.
```sql
ALTER TABLE queries ADD COLUMN embedding svector; -- for pgvecto.rs users
ALTER TABLE queries ADD COLUMN embedding sparsevec; -- for pgvector users
UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query)::svector; -- for pgvecto.rs users
UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query, 'pgvector')::sparsevec; -- for pgvector users
SELECT sum((array[answer_pids] = array(SELECT pid FROM documents WHERE queries.dataset = documents.dataset ORDER BY queries.embedding <#> documents.embedding LIMIT 1))::int) FROM queries;
```
This workflow showcases how to leverage BM25 text queries and vector search in PostgreSQL using this extension. The Top 1 recall of BM25 on this dataset is `0.77`. If you reproduce the result, your operations are correct.
## Build from source
Before building, you should have `PostgreSQL`, `Rust` and `Cargo` installed on your system.
1. Install `cargo-pgrx`.
```sh
cargo install cargo-pgrx --version v0.12.0-alpha.1
```
2. Initialize `cargo-pgrx`.
```sh
cargo pgrx init --pg16=$(which pg_config) # assuming that you have PostgreSQL 16 installed
```
3. Build.
```sh
cargo pgrx install --release # if you want to install it on your machine
cargo pgrx package # if you want to package `pg_bestmatch`
```
## Comparison with pg_search
- `pg_bestmatch.rs` only provides methods for generating sparse vectors and does not support index-based search (which can be achieved by pgvecto.rs or pgvector).
- `pg_search` performs BM25 retrieval via the external `tantivy` engine, which may have limitations when combined with transactions, filters, or JOIN operations. Since `pg_bestmatch.rs` is entirely native to Postgres, it offers full compatibility with these operations inside postgres.
## Reference
- `tokenize`
- Description: Tokenizes an input string into individual tokens.
- Example:
```sql
SELECT tokenize('i have an apple'); -- result: {i,have,an,apple}
```
- `bm25_create`
- Description: Creates BM25 statistics for a specified table and column.
- Usage:
```sql
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25');
```
- Parameters:
- `table_name`: Name of the table.
- `column_name`: Name of the column.
- `stat_name`: Name of the BM25 statistics.
- `b`: BM25 parameter (default 0.75).
- `k`: BM25 parameter (default 1.2).
- `bm25_refresh`
- Description: Updates the BM25 statistics to reflect any changes in the underlying data.
- Usage:
```sql
SELECT bm25_refresh('documents_passage_bm25');
```
- Parameters:
- `stat_name`: Name of the BM25 statistics to update.
- `bm25_drop`
- Description: Deletes the BM25 statistics for a specified table and column.
- Usage:
```sql
SELECT bm25_drop('documents_passage_bm25');
```
- Parameters:
- `stat_name`: Name of the BM25 statistics to delete.
- `bm25_document_to_svector`
- Description: Converts document text into a sparse vector representation.
- Usage:
```sql
SELECT bm25_document_to_svector('documents_passage_bm25', 'document_text');
```
- Parameters:
- `stat_name`: Name of the BM25 statistics.
- `document_text`: The text of the document.
- `style`: Emits `pgvecto.rs`-style sparse vector or `pgvector`-style sparse vector.
- `bm25_query_to_svector`
- Description: Converts query text into a sparse vector representation.
- Usage:
```sql
SELECT bm25_query_to_svector('documents_passage_bm25', 'We begin, as always, with the text.');
```
- Parameters:
- `stat_name`: Name of the BM25 statistics.
- `query_text`: The text of the query.
- `style`: Emits `pgvecto.rs`-style sparse vector or `pgvector`-style sparse vector.