https://github.com/mc-cat-tty/placerank
Final assigment for "Gestione dell'Informazione" ("Search Engines") course @ UniMoRe
https://github.com/mc-cat-tty/placerank
airbnb benchmarking bert-embeddings datasets huggingface huggingface-transformers information-retrieval insideairbnb-data masked-language-models ncurses ranking-algorithm search-engine urwid whoosh
Last synced: 2 months ago
JSON representation
Final assigment for "Gestione dell'Informazione" ("Search Engines") course @ UniMoRe
- Host: GitHub
- URL: https://github.com/mc-cat-tty/placerank
- Owner: mc-cat-tty
- Created: 2024-01-17T20:03:57.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-02-08T08:22:05.000Z (over 1 year ago)
- Last Synced: 2024-10-28T00:21:28.889Z (12 months ago)
- Topics: airbnb, benchmarking, bert-embeddings, datasets, huggingface, huggingface-transformers, information-retrieval, insideairbnb-data, masked-language-models, ncurses, ranking-algorithm, search-engine, urwid, whoosh
- Language: Jupyter Notebook
- Homepage:
- Size: 45.3 MB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PlaceRank
Search engine for AirBnB listings.
Final assignment of the "Gestione dell'informazione" course at University of Modena and Reggio Emilia. Academic year 2023-2024.
## Bringup
At least version **3.11** of the **Python** interpreter is needed.In order to enjoy our not-so-SOTA search engine, the average user needs to run the following commands in a shell where the Python interpreter is available:
```bash
# INSTALL DEPENDENCIES
python3 -m pip install -r requirements.txt# DOWNLOAD DATASET, CREATE INDEX, DOWNLOAD WORDNET AND BERT MODEL
python3 -m setup
```Please, be aware that `bert-large-uncased-whole-word-masking` can take up to 1.5 Gb of disk space and 30 min to download.
The model is by default stored in _hf\_cache_ folder.
For experienced user, we suggest to firstly crate a virtual environment, where all packages will be installed; then follow the above procedure:
```bash
python3 -m venv venv
source venv/bin/activate
```## Usage
The Placerank project embraces different modules, each of them with a specific purpose, usually self-explanatory. The most significant ones are:
- `ir_model`, `models`, `sentiment` and `query_expansion` modules: contain some models and services that the user can experiment with through the following blocks
- `tui` package: contains view, presenter, event dispatcher and all the logic that is under the ui's hood
- `benchmark` module: contains the implementation of some popular benchmarking metrics
- `preprocessing`, `dataset`, `views`, `config` modules: contain the building blocks and convenience functions/classes for the entire project### TUI
The TUI - Terminal User Interface - is the front-end for our project. Launch the following command with a terminal window big enough:
```bash
python3 -m placerank
```In case of any doubt about the interface visit [help page](HELP.txt).
Note that the application can take up to some seconds to load, especially at the first run.
![]()
#### Common Exceptions
`urwid.widget.widget.WidgetError: ... canvas when passed size ...`. This class of errors usually means that the terminal **window** is **too small** for the TUI to be rendered.### Benchmarks
The Benchmark module is designed to test the performance of an index against predefined queries. It includes functionality to load a benchmark dataset, test an index against the queries, and compute various evaluation metrics such as recall, precision, precision at ranking r, average precision, mean average precision, F1 score, and the E-measure.To use the Benchmark module, follow these steps:
Setup benchmarks:
```python
python3 -m setup_benchmarks
```Create a Benchmark object:
```python
bench = Benchmark()
```Open the index:
```python
ix = open_dir("index/benchmark")
```Test the benchmark against the index. This is required to compute different metrics on the benchmark.
```python
bench.test_against(ix)
```Print or use the computed metrics by using the object methods:
```python
print(bench.precision())
print(bench.recall())
print(bench.precision_at_r())
print(bench.precision_at_recall_levels())
print(bench.average_precision())
print(bench.mean_average_precision())
print(bench.f1())
print(bench.e())
```Calling the module `placerank.benchmark` from the command line computes all of the metrics above for the "index/benchmark" index, which is an inverted index built on InsideAirbnb Cambridge listings.
### Reviews
The reviews dataset is used to compute the sentiment metric for each listing. Recent reviews have a major weight on the score than older ones.
To compute sentiment for each review, use the function `build_reviews_index` of `placerank.dataset` to build the dataset of reviews.
The function initializes a defaultdict where keys are listing IDs, and values are lists of tuples containing review information.The dataset will be saved in a `reviews.pickle` file, to load it call the function `load_reviews_index`.
## Contributors
- Corradini Giulio
- Mecatti Francesco
- Stano Antonio