https://github.com/mayank-02/boolean-retrieval-model
Implementation of Boolean Model of Information Retrieval
https://github.com/mayank-02/boolean-retrieval-model
boolean-model information-retrieval python unranked-retrieval
Last synced: about 1 year ago
JSON representation
Implementation of Boolean Model of Information Retrieval
- Host: GitHub
- URL: https://github.com/mayank-02/boolean-retrieval-model
- Owner: mayank-02
- License: mit
- Created: 2020-12-09T12:32:52.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2024-09-03T12:49:46.000Z (over 1 year ago)
- Last Synced: 2025-03-29T09:51:15.914Z (about 1 year ago)
- Topics: boolean-model, information-retrieval, python, unranked-retrieval
- Language: Python
- Homepage:
- Size: 20.5 KB
- Stars: 3
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Boolean Retrieval Model
The class `BooleanModel.py` implements a toy search engine to illustrate the boolean retrieval model for text documents.
The program asks you to enter a search query, and then returns all documents matching the query (exact match), in no particular order (unranked retrieval).
The document corpus consists of documents, which are short stories downloaded from [here](https://www.rong-chang.com/qa2/).
## Getting Started
- Install Python 3.6+
- Install all pip requirements from the `requirements.txt`:
```bash
$ python3 -m pip install -r requirements.txt
```
- To download stopwords used for the model, open your terminal or command prompt and enter following commands:
```bash
$ python3
>>> import nltk
>>> nltk.download('stopwords')
```
## Usage
```python
# Import boolean model
from BooleanModel import BooleanModel
# Create a model on your corpus of documents by passing it's path as an argument
model = BooleanModel("./corpus/*")
# Query on it as many times as you like
results = model.query("book")
# results = ['Freeway Chase Ends at Newsstand.txt', 'A Festival of Books.txt']
# Querying on a word which is not in the corpus
results = model.query("pikachu")
# Warning: pikachu was not found in the corpus!
# results = []
```
### Queries
#### Supported Queries
- Single term => `ash`
- AND => `ash & may`
- OR => `ash | may & brown`
- Parenthesis => `( ash | may ) & brown`
- NOT => `( ~ash | may ) & brown`
> Precedence: NOT (~) > AND (&) > OR (|)
#### Unsupported Queries
- NOT operator on an intermediate result => `~( ash | may ) & brown`
- Spaces between NOT operator and operand => `~ ash & may`
## Methodology
1. Preprocessing to build standard inverted index
- Remove special characters
- Remove digits
- Tokenize
- Lowercasing
- Stemming using `PorterStemmer`
- Add unique words and their postings to the index
2. Refer to [this](https://nlp.stanford.edu/IR-book/html/htmledition/an-example-information-retrieval-problem-1.html) for the internals of boolean model and query evaluation
## Note
- In case of start byte invalid errors, check for character encodings of the documents in corpus. (Currently, `utf-8` is used.)
## Authors
[Mayank Jain](https://github.com/mayank-02)
## License
MIT