https://github.com/jankovicsandras/bm25opt

faster BM25 search algorithms in Python
https://github.com/jankovicsandras/bm25opt

bag-of-words bm25 bm25l bm25okapi bm25plus document-search python3 text-search

Last synced: about 1 month ago
JSON representation

faster BM25 search algorithms in Python

Host: GitHub
URL: https://github.com/jankovicsandras/bm25opt
Owner: jankovicsandras
License: apache-2.0
Created: 2024-10-24T08:30:32.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-11-14T13:26:45.000Z (6 months ago)
Last Synced: 2025-03-28T00:44:14.644Z (about 2 months ago)
Topics: bag-of-words, bm25, bm25l, bm25okapi, bm25plus, document-search, python3, text-search
Language: Jupyter Notebook
Homepage:
Size: 69.3 KB
Stars: 19
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # BM25opt

## faster BM25 search algorithms in Python

####  based on https://github.com/dorianbrown/rank_bm25 by Dorian Brown

####  Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0

----

## News:

 - 1.1.0 supports updating the index with new ```add_documents()```, ```delete_documents()``` and ```update_documents()``` functions, see Example 4

----

## Usage:

#### Input:

 - ```corpus``` is a list of strings, e.g. ```[ 'bla bla bla', 'this is document two', ... ]```

 - ```question``` is a string, e.g. ```'which text contains the word two?'```

 - optional arguments:

   - ```algo``` : BM25 algorithm, the default is ```'okapi'```; ```'l'``` and ```'plus'``` available

   - ```tokenizer_function``` : the default is ```tokenizer_default``` which is split-on-whitespace, lowercase, remove common punctiations

   - ```idf_algo``` : default uses the same IDF as ```rank_bm25```; values ```'okapi'```, ```'l'``` and ```'plus'``` can override to fix https://github.com/dorianbrown/rank_bm25/issues/35

   - ```k1```, ```b```, ```epsilon```, ```delta``` : constants with standard default values, see https://en.wikipedia.org/wiki/Okapi_BM25

#### Example 1:

This example uses the default tokenizer and the default BM25Okapi algorithm and returns the top 5 highest scoring document ids and scores.

```python

# creating the index

bm25opt_index = BM25opt( corpus )

# search

results = bm25opt_index.topk( question, 5 )

print( 'results[0] id', results[0][0], 'results[0] score', results[0][1], 'results[0] document', corpus[ results[0][0] ] )

```

#### Example 2:

This example returns the list of document scores (order is the same as the document order in corpus), shows algoritm selection and custom tokenizer function.

```python

bm25opt_index = BM25opt( corpus, algo='plus', tokenizer_function=some_tokenizer_function )

doc_scores = bm25opt_index.get_scores( question )

```

#### Example 3: comparison with rank_bm25

This example shows the score list and the similarity with [```rank_bm25```](https://github.com/dorianbrown/rank_bm25), but NOTE: BM25opt input is not tokenized beforehand.

```python

corpus = [ ... ]

question = '...'

tokenized_corpus = [ tokenizer_default(document) for document in corpus ]

tokenized_question = tokenizer_default( question )

rank_bm25_index = BM25Okapi( tokenized_corpus )

bm25opt_index = BM25opt( corpus, algo='okapi' )

rank_bm25_scores = rank_bm25_index.get_scores( tokenizedquestion )

bm25opt_scores = bm25opt_index.get_scores( question )

```

#### Example 4: updating the index

```python

# creating the index

bm25opt_index = BM25opt( corpus )

# add new documents

bm25opt_index.add_documents( corpus2 ) 

# delete from the index

delete_ids = [ 1, 3, 5 ] # list of document ids (indices in corpus) to delete from the index

bm25opt_index.delete_documents( delete_ids )

# in-place update changed documents in the index

update_ids = [ 1, 3, 5 ] # list of document ids (indices in corpus) to change

updated_documents = [ 'first changed document', 'second changed document', ... ]

bm25opt_index.update_documents( update_ids, updated_documents )

```

----

### Notes:

This is an optimized variant of rank_bm25 where the key insight is that we can calculate almost everything at index creation time in ```__init__()``` , resulting a words * documents-score dict, e.g.

```python

wsmap = {

  'word1': [ word1_doc1_score, word1_doc2_score, ... ],

  'word2': [ word2_doc1_score, word2_doc2_score, ... ],

  ...

}

```

then the query function is just adding the score lists for each word in the question, e.g. 

```python

question = 'word1 word2'

doc_scores = [ wsmap['word1'][0] + wsmap['word2'][0], wsmap['word1'][1] + wsmap['word2'][1], ... ]

```

Another important change is the un-tokenized inputs and registration of the tokenizer function, which is important to avoid situations where the corpus would be tokenized with a different function than the queries later. A simple ```tokenizer_default()``` function is provided as a default.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jankovicsandras/bm25opt

Awesome Lists containing this project

README