https://github.com/mayank-02/boolean-retrieval-model

Implementation of Boolean Model of Information Retrieval
https://github.com/mayank-02/boolean-retrieval-model

boolean-model information-retrieval python unranked-retrieval

Last synced: over 1 year ago
JSON representation

Implementation of Boolean Model of Information Retrieval

Host: GitHub
URL: https://github.com/mayank-02/boolean-retrieval-model
Owner: mayank-02
License: mit
Created: 2020-12-09T12:32:52.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2024-09-03T12:49:46.000Z (almost 2 years ago)
Last Synced: 2025-03-29T09:51:15.914Z (over 1 year ago)
Topics: boolean-model, information-retrieval, python, unranked-retrieval
Language: Python
Homepage:
Size: 20.5 KB
Stars: 3
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Boolean Retrieval Model

The class `BooleanModel.py` implements a toy search engine to illustrate the boolean retrieval model for text documents.

The program asks you to enter a search query, and then returns all documents matching the query (exact match), in no particular order (unranked retrieval).

The document corpus consists of documents, which are short stories downloaded from [here](https://www.rong-chang.com/qa2/).

## Getting Started

- Install Python 3.6+

- Install all pip requirements from the `requirements.txt`:

```bash

$ python3 -m pip install -r requirements.txt

```

- To download stopwords used for the model, open your terminal or command prompt and enter following commands:

```bash

$ python3

>>> import nltk

>>> nltk.download('stopwords')

```

## Usage

```python

# Import boolean model

from BooleanModel import BooleanModel

# Create a model on your corpus of documents by passing it's path as an argument

model = BooleanModel("./corpus/*")

# Query on it as many times as you like

results = model.query("book")

# results = ['Freeway Chase Ends at Newsstand.txt', 'A Festival of Books.txt']

# Querying on a word which is not in the corpus

results = model.query("pikachu")

# Warning: pikachu was not found in the corpus!

# results = []

```

### Queries

#### Supported Queries

- Single term => `ash`

- AND => `ash & may`

- OR => `ash | may & brown`

- Parenthesis => `( ash | may ) & brown`

- NOT => `( ~ash | may ) & brown`

> Precedence: NOT (~) > AND (&) > OR (|)

#### Unsupported Queries

- NOT operator on an  intermediate result => `~( ash | may ) & brown`

- Spaces between NOT operator and operand => `~ ash & may`

## Methodology

1. Preprocessing to build standard inverted index

   - Remove special characters

   - Remove digits

   - Tokenize

   - Lowercasing

   - Stemming using `PorterStemmer`

   - Add unique words and their postings to the index

2. Refer to [this](https://nlp.stanford.edu/IR-book/html/htmledition/an-example-information-retrieval-problem-1.html) for the internals of boolean model and query evaluation

## Note

- In case of start byte invalid errors, check for character encodings of the documents in corpus. (Currently, `utf-8` is used.)

## Authors

[Mayank Jain](https://github.com/mayank-02)

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mayank-02/boolean-retrieval-model

Awesome Lists containing this project

README