https://github.com/louislefevre/information-retrieval-models

Ranks passages against queries using various models and techniques.
https://github.com/louislefevre/information-retrieval-models

bm25 dirichlet-smoothing information-retrieval laplace-smoothing lidstone-smoothing query-likelihood tfidf vectorspace

Last synced: 7 months ago
JSON representation

Ranks passages against queries using various models and techniques.

Host: GitHub
URL: https://github.com/louislefevre/information-retrieval-models
Owner: louislefevre
Created: 2021-03-12T03:15:23.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2021-09-20T00:54:58.000Z (about 4 years ago)
Last Synced: 2025-02-05T18:27:37.645Z (8 months ago)
Topics: bm25, dirichlet-smoothing, information-retrieval, laplace-smoothing, lidstone-smoothing, query-likelihood, tfidf, vectorspace
Language: Python
Homepage:
Size: 162 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Information Retrieval Models
Ranks passages against queries using various models and techniques.

## Models
- BM25 - Probabilistic retrieval model for estimating the relevance of a passage.
- VectorSpace - Algebraic model for representing passages as vectors.
- QueryLikelihood - Language model for calculating the likelihood of a document being relevant to a given query.

## How to Run
The program can be initialised by running *start.py*, which accepts parameters in the format of:
`start.py [-s ]`

### Parameters
#### Dataset
- The `` parameter is required and is the path of the dataset to be parsed.
- Expects a TSV file in the format **, where qid is the query ID, pid is the ID of the passage retrieved, query is the query text, and passage is the passage text.
- Each column must be tab separated.

#### Model
- The `` parameter is required and is the name of the model to be used for ranking passages.
- Expects either 'bm25' for the BM25 model, 'vs' for the Vector Space model, or 'lm' for the query likelihood model.
- Any other input will be deemed invalid, and an exception will be raised.

#### Smoothing
- The `-s ` parameter is required only when using the Query Likelihood model, and is the name of the smoothing technique which will be applied.
- Expects either `laplace` for Laplace smoothing, `lidstone` for Lidstone smoothing, or `dirichlet` for Dirichlet smoothing.
- This parameter can only ever be used if the Query Likelihood model was selected for the `` parameter, and an exception will be raised if any other model is used.

### Examples
- `start.py dataset.tsv bm25`
- `start.py dataset.tsv vs`
- `start.py dataset.tsv lm -s laplace`

## Dependencies
- [numpy](https://pypi.org/project/numpy/)
- [matplotlib](https://pypi.org/project/matplotlib/)
- [nltk](https://pypi.org/project/nltk/)
- [num2words](https://pypi.org/project/num2words/)
- [tabulate](https://pypi.org/project/tabulate/)
- [punkt (nltk module)](http://www.nltk.org/api/nltk.tokenize.html?highlight=punkt)
- [stopwords (nltk module)](https://www.nltk.org/api/nltk.corpus.html)
*NLTK modules are downloaded automatically at runtime*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/louislefevre/information-retrieval-models

Awesome Lists containing this project

README