https://github.com/tensorchord/pg_bestmatch.rs

Generate BM25 sparse vector inside PostgreSQL
https://github.com/tensorchord/pg_bestmatch.rs
Last synced: about 1 year ago
JSON representation
Generate BM25 sparse vector inside PostgreSQL
Host: GitHub
URL: https://github.com/tensorchord/pg_bestmatch.rs
Owner: tensorchord
License: apache-2.0
Created: 2024-05-27T06:29:54.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-11-05T04:02:56.000Z (over 1 year ago)
Last Synced: 2025-04-02T08:48:10.353Z (over 1 year ago)
Language: Rust
Homepage:
Size: 1.49 MB
Stars: 63
Watchers: 4
Forks: 11
Open Issues: 10
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # pg_bestmatch.rs

This PostgreSQL extension provides functionalities for BM25 text queries, generateing BM25 statistic sparse vectors for text. BM25 outperforms dense vector-based retrieval methods in many [RAG benchmark tasks](https://hazyresearch.stanford.edu/blog/2024-05-20-m2-bert-retrieval).

User can use vector search extensions such as `pgvecto.rs` or `pgvector` for efficient searches in postgres.

> [!IMPORTANT]  

> Based on our initial tests, HNSW indexing does not support the sparse vectors generated by BM25 very well. The high sparsity prevents effective navigation within the graph.

* [Installation](#installation)

* [How does it work?](#how-does-it-work)

* [Usage](#usage)

* [Build from source](#build-from-source)

* [Comparison with pg_search](#comparison-with-pg_search)

* [Reference](#Reference)

## Installation

```sql

CREATE EXTENSION pg_bestmatch;

SET search_path TO public, bm_catalog;

```

## How does it work?

- Create an BM25 statistics based on your document set by `bm25_create(table_name, column_name, statistic_name);`. It will create a materilized view to record the stats. 

- Generate document sparse vector by `bm25_document_to_svector(statistic_name, passage)`

- For query, generate query sparse vector `bm25_query_to_svector(statistic_name, query)`

- Calculate the score by dot product between the query sparse vector and the document sparse vector

- Currently we use huggingface tokenizer with `bert-base-uncased` vocabulary set to tokenize words. Might support more configuration on tokenizer in the future.

## Usage

Here is an example workflow demonstrating the usage of this extension with the example of [Stanford LoCo benchmark](https://hazyresearch.stanford.edu/blog/2024-05-20-m2-bert-retrieval).

0. Load the dataset. Here is a script for you if you want to experience `pg_bestmatch` with the dataset.

```sh

wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Documents/parquet/default/test/0.parquet -O documents.parquet

wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Queries/parquet/default/test/0.parquet -O queries.parquet

```

```python

import pandas as pd

from sqlalchemy import create_engine

import numpy as np

from psycopg2.extensions import register_adapter, AsIs

def adapter_numpy_float64(numpy_float64):

    return AsIs(numpy_float64)

def adapter_numpy_int64(numpy_int64):

    return AsIs(numpy_int64)

def adapter_numpy_float32(numpy_float32):

    return AsIs(numpy_float32)

def adapter_numpy_int32(numpy_int32):

    return AsIs(numpy_int32)

def adapter_numpy_array(numpy_array):

    return AsIs(tuple(numpy_array))

register_adapter(np.float64, adapter_numpy_float64)

register_adapter(np.int64, adapter_numpy_int64)

register_adapter(np.float32, adapter_numpy_float32)

register_adapter(np.int32, adapter_numpy_int32)

register_adapter(np.ndarray, adapter_numpy_array)

db_url = "postgresql://localhost:5432/pg_bestmatch_test"

engine = create_engine(db_url)

def load_documents():

    df = pd.read_parquet("documents.parquet")

    df.to_sql("documents", engine, if_exists='replace', index=False)

def load_queries():

    df = pd.read_parquet("queries.parquet")

    df['answer_pids'] = df['answer_pids'].apply(lambda x: str(x[0]))    

    df.to_sql("queries", engine, if_exists='replace', index=False)

load_documents()

load_queries()

```

1. Create BM25 statistics for the `documents` table.

```sql

SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 0.75, 1.2);

```

2. Add an embedding column to the `documents` and `queries` tables and update the embeddings for documents and queries.

```sql

ALTER TABLE documents ADD COLUMN embedding svector; -- for pgvecto.rs users

ALTER TABLE documents ADD COLUMN embedding sparsevec; -- for pgvector users

UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage)::svector; -- for pgvecto.rs users

UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage, 'pgvector')::sparsevec; -- for pgvector users

```

3. (Optional) Create a vector index on the sparse vector column.

```sql

CREATE INDEX ON documents USING vectors (embedding svector_dot_ops); -- for pgvecto.rs users

CREATE INDEX ON documents USING ivfflat (embedding sparsevec_ip_ops); -- for pgvector users

```

4. Perform a vector search to find the most relevant documents for each query.

```sql

ALTER TABLE queries ADD COLUMN embedding svector; -- for pgvecto.rs users

ALTER TABLE queries ADD COLUMN embedding sparsevec; -- for pgvector users

UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query)::svector; -- for pgvecto.rs users

UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query, 'pgvector')::sparsevec; -- for pgvector users

SELECT sum((array[answer_pids] = array(SELECT pid FROM documents WHERE queries.dataset = documents.dataset ORDER BY queries.embedding <#> documents.embedding LIMIT 1))::int) FROM queries;

```

This workflow showcases how to leverage BM25 text queries and vector search in PostgreSQL using this extension. The Top 1 recall of BM25 on this dataset is `0.77`. If you reproduce the result, your operations are correct.

## Build from source

Before building, you should have `PostgreSQL`, `Rust` and `Cargo` installed on your system.

1. Install `cargo-pgrx`.

```sh

cargo install cargo-pgrx --version v0.12.0-alpha.1

```

2. Initialize `cargo-pgrx`.

```sh

cargo pgrx init --pg16=$(which pg_config)   # assuming that you have PostgreSQL 16 installed

```

3. Build.

```sh

cargo pgrx install --release    # if you want to install it on your machine

cargo pgrx package  # if you want to package `pg_bestmatch`

```

## Comparison with pg_search 

- `pg_bestmatch.rs` only provides methods for generating sparse vectors and does not support index-based search (which can be achieved by pgvecto.rs or pgvector). 

- `pg_search` performs BM25 retrieval via the external `tantivy` engine, which may have limitations when combined with transactions, filters, or JOIN operations. Since `pg_bestmatch.rs` is entirely native to Postgres, it offers full compatibility with these operations inside postgres.

## Reference

- `tokenize`

  - Description: Tokenizes an input string into individual tokens.

  - Example:

    ```sql

    SELECT tokenize('i have an apple'); -- result: {i,have,an,apple}

    ```

- `bm25_create`

  - Description: Creates BM25 statistics for a specified table and column.

  - Usage: 

    ```sql

    SELECT bm25_create('documents', 'passage', 'documents_passage_bm25');

    ```

  - Parameters:

    - `table_name`: Name of the table.

    - `column_name`: Name of the column.

    - `stat_name`: Name of the BM25 statistics.

    - `b`: BM25 parameter (default 0.75).

    - `k`: BM25 parameter (default 1.2).

- `bm25_refresh`

  - Description: Updates the BM25 statistics to reflect any changes in the underlying data.

  - Usage:

    ```sql

    SELECT bm25_refresh('documents_passage_bm25');

    ```

  - Parameters:

    - `stat_name`: Name of the BM25 statistics to update.

- `bm25_drop`

  - Description: Deletes the BM25 statistics for a specified table and column.

  - Usage:

    ```sql

    SELECT bm25_drop('documents_passage_bm25');

    ```

  - Parameters:

    - `stat_name`: Name of the BM25 statistics to delete.

- `bm25_document_to_svector`

  - Description: Converts document text into a sparse vector representation.

  - Usage:

    ```sql

    SELECT bm25_document_to_svector('documents_passage_bm25', 'document_text');

    ```

  - Parameters:

    - `stat_name`: Name of the BM25 statistics.

    - `document_text`: The text of the document.

    - `style`: Emits `pgvecto.rs`-style sparse vector or `pgvector`-style sparse vector.

- `bm25_query_to_svector`

  - Description: Converts query text into a sparse vector representation.

  - Usage:

    ```sql

    SELECT bm25_query_to_svector('documents_passage_bm25', 'We begin, as always, with the text.');

    ```

  - Parameters:

    - `stat_name`: Name of the BM25 statistics.

    - `query_text`: The text of the query.

    - `style`: Emits `pgvecto.rs`-style sparse vector or `pgvector`-style sparse vector.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tensorchord/pg_bestmatch.rs

Awesome Lists containing this project

README