Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/basementuniverse/bm25
Search for terms in an array of documents using BM25
https://github.com/basementuniverse/bm25
Last synced: 23 days ago
JSON representation
Search for terms in an array of documents using BM25
- Host: GitHub
- URL: https://github.com/basementuniverse/bm25
- Owner: basementuniverse
- License: mit
- Created: 2024-02-22T23:39:41.000Z (11 months ago)
- Default Branch: master
- Last Pushed: 2024-02-22T23:40:09.000Z (11 months ago)
- Last Synced: 2024-04-14T04:43:09.170Z (9 months ago)
- Language: TypeScript
- Size: 36.1 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Okapi BM25
Search for terms in an array of documents using [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25).
## Installation
```
npm install -g @basementuniverse/bm25
```## Usage
```typescript
import { Corpus } from '@basementuniverse/bm25';const corpus = new Corpus([
'This is a document',
'Here is another document',
]);const results = corpus.search('document');
````results` will look something like:
```json
[
{
"document": "This is a document",
"score": 0.5
},
{
"document": "Here is another document",
"score": 0.5
}
]
```The documents passed into the `Corpus` constructor will be treated as strings by default, and will be converted to lowercase and split by non-word characters.
However, it is possible to pass in values of any type here, as long as you provide a function to convert each value to an array of strings. For example:
```typescript
const corpus = new Corpus(
[
{
id: '1234',
name: 'John Doe',
},
{
id: '2345',
name: 'Jane Doe',
},
],
{
processor: document => [document.id, ...document.name.toLowerCase().split(' ')],
},
);
```Partial term matching can be enabled by passing `true` as the second argument to `search()`:
```typescript
const results = corpus.search('doe', true);
```## Options
The 2nd argument to the `Corpus` constructor is an options object, which can contain the following properties:
- `processor` (function) - A function to convert each document to an array of strings.
- `k1` (number between 1.2 and 2, default: 1.5) - Controls the impact of term frequency saturation.
- `b` (number between 0 and 1, default: 0.75) - Controls how much the document length affects the term frequency score.
- `gamma` (number, default: 1) - Addresses a deficiency of BM25 in which term frequency normalization by document length is not properly lower-bounded.