https://github.com/ev2900/bm25_search_example

Example to help understand how the BM25 term based ranking model works in search applications
https://github.com/ev2900/bm25_search_example

bm25 python search similarity-search vector-search

Last synced: 7 days ago
JSON representation

Example to help understand how the BM25 term based ranking model works in search applications

Host: GitHub
URL: https://github.com/ev2900/bm25_search_example
Owner: ev2900
Created: 2023-08-25T17:45:31.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-09-26T12:47:11.000Z (15 days ago)
Last Synced: 2025-09-26T14:42:16.241Z (15 days ago)
Topics: bm25, python, search, similarity-search, vector-search
Language: Python
Homepage:
Size: 57.6 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # BM25 Example

 

I based this example on the YouTube video - [How to Create a BM25 Index in Python with Rank BM25 (Search Engine)](https://www.youtube.com/watch?v=ysvpxiPAHLg). The example uses the [rank_bm25](https://github.com/dorianbrown/rank_bm25) Python library.

## What is BM25

BM25 is a formula used by search engines to figure out which documents are most relevant to a search query. BM25 looks at things like how often the words in the query appear in a document and how common those words are across all documents

## Example in Python

### Install rank_bm25 library

The [rank_bm25](https://github.com/dorianbrown/rank_bm25) Python library makes is easy to implement BM25 algorithms in Python

```pip install rank_bm25``` or in a Juypter notebook run ```!pip install rank_bm25```

To confirm that the installation was successful import the BM250kapi item from the rank_bm25 library run

```from rank_bm25 import BM25Okapi```

### Create corpus of documents

In this example we have a corpus of 3 documents

```

corpus = [

  "Hello there good man!",

  "It is quite windy in London",

  "How is the weather today?"

]

```

### Tokenize each document

Tokenization breaks down sentences into individual words. Each work is a token. Tokenization is an important preprocessing step because BM25 operates on the level of individual tokens

The code below tokenizes each document by breaking each sentence into a list of words

```

tokenized_corpus = []

for doc in corpus:

  doc_tokens = doc.split()

  tokenized_corpus.append(doc_tokens)

```

If you print the tokenized_corpus. The tokenized_corpus would looke like

```

print(tokenized_corpus)

---- result below ----

[['Hello', 'there', 'good', 'man!'],

 ['It', 'is', 'quite', 'windy', 'in', 'London'],

 ['How', 'is', 'the', 'weather', 'today?']]

```

### Create a BM25 index from the tokenized document corpus

``` bm25 = BM25Okapi(tokenized_corpus) ```

### Query the BM25 index

We can search for *windy London* . The document *It is quite windy in London* should be returned as the most relevant match to our search.

```

query = "windy London"

tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)

print(doc_scores)

---- result below ----

[0, 0.937, 0]

```

The ```get_scores``` method returns a score of how relevant each document is to the query. A score of 1 is very relevant a score of 0 is not relevant. Notice the second sentence in our document corpus has the highest score. This makes sense given the second sentence in our corpus is *It is quite windy in London* and our query is *windy London*

We can use the  ```get_top_n``` method to return the top relevant documents as opposed to just the relevancy scores the ```get_scores``` method returns

```

doc = bm25.get_top_n(tokenized_query, corpus, n=1)

print(doc)

---- result below ----

['It is quite windy in London']

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ev2900/bm25_search_example

Awesome Lists containing this project

README