https://github.com/stephanj/bm25
A BM25 Java implementation using streams, stop words and stemming.
https://github.com/stephanj/bm25
bm25 llm nlp rerank stemming
Last synced: 4 months ago
JSON representation
A BM25 Java implementation using streams, stop words and stemming.
- Host: GitHub
- URL: https://github.com/stephanj/bm25
- Owner: stephanj
- License: mit
- Created: 2024-02-17T10:46:48.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-02-17T17:11:44.000Z (over 1 year ago)
- Last Synced: 2024-12-02T17:30:00.805Z (6 months ago)
- Topics: bm25, llm, nlp, rerank, stemming
- Language: Java
- Homepage:
- Size: 34.2 KB
- Stars: 28
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# BM25 Java Implementation
BM25 (Best Matching 25) is a ranking function used by search engines to rank matching documents according to their relevance to a given search query.
See also https://en.wikipedia.org/wiki/Okapi_BM25
# Simple usage
```java
List corpus = List.of(
"I love programming",
"Java is my favorite programming language",
"I enjoy writing code in Java",
"Java is another popular programming language",
"I find programming fascinating",
"I love Java",
"I prefer Java over Python"
);BM25 bm25 = new BM25(corpus);
List> results = bm25.search("I love java");
for (Map.Entry entry : results) {
System.out.println("Sentence " + entry.getKey() + " : Score = " + entry.getValue() + " - [" + corpus.get(entry.getKey()) + "]");
}
``````
Sentence 5 : Score = 2.286729869084079 - [I love Java]
Sentence 0 : Score = 1.8387268317084793 - [I love programming]
Sentence 6 : Score = 0.7294916714788526 - [I prefer Java over Python]
Sentence 2 : Score = 0.6674701123652661 - [I enjoy writing code in Java]
Sentence 4 : Score = 0.40211004330297734 - [I find programming fascinating]
Sentence 1 : Score = 0.33373505618263305 - [Java is my favorite programming language]
Sentence 3 : Score = 0.33373505618263305 - [Java is another popular programming language]
``````Java
bm25.search("programming");
``````
Sentence 0 : Score = 0.687935390645563 - [I love programming]
Sentence 4 : Score = 0.6174639603843102 - [I find programming fascinating]
Sentence 1 : Score = 0.5124700885780712 - [Java is my favorite programming language]
Sentence 3 : Score = 0.5124700885780712 - [Java is another popular programming language]
Sentence 2 : Score = 0.0 - [I enjoy writing code in Java]
Sentence 5 : Score = 0.0 - [I love Java]
Sentence 6 : Score = 0.0 - [I prefer Java over Python]
```# With stop words
Get better results by removing language-specific stop words.
Based on ISO provided list from https://github.com/stopwords-iso
Current implementation supports English, French, German, Dutch, Italian and Spanish stop words.
```Java
BM25 bm25 = new BM25(corpus, StopWords.ENGLISH);
```# With Stemming
Get better results by using stemming.
Stemming maps different forms of the same word to a common "stem".
For example, the English stemmer maps running, run, runs to run.
So a search for 'running' would also find documents which only have the other forms.```Java
BM25 bm25 = new BM25(corpus, StopWords.ENGLISH, new EnglishStemmer());
```The default implementation uses the Porter2 stemmer from Snowball.
You can add other Stemmer implementations, for example, CoreNLP or Lucene.