https://github.com/thongnt99/learned-sparse-retrieval

Unified Learned Sparse Retrieval Framework
https://github.com/thongnt99/learned-sparse-retrieval

learned-sparse-retrieval lsr neural-ir sparse-retrieval transformers

Last synced: 5 months ago
JSON representation

Unified Learned Sparse Retrieval Framework

Host: GitHub
URL: https://github.com/thongnt99/learned-sparse-retrieval
Owner: thongnt99
License: apache-2.0
Created: 2023-01-09T09:21:37.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2024-05-13T00:41:14.000Z (about 2 years ago)
Last Synced: 2024-11-11T18:43:46.113Z (over 1 year ago)
Topics: learned-sparse-retrieval, lsr, neural-ir, sparse-retrieval, transformers
Language: Python
Homepage:
Size: 275 KB
Stars: 58
Watchers: 4
Forks: 6
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

![](https://badgen.net/badge/lsr/instructions/red?icon=github) ![](https://badgen.net/badge/python/3.9.12/green?icon=python)
[![DOI](https://zenodo.org/badge/586806249.svg)](https://zenodo.org/doi/10.5281/zenodo.10659499)

# LSR: A unified framework for efficient and effective learned sparse retrieval

The framework provides a simple yet effective toolkit for defining, training, and evaluating learned sparse retrieval methods. The framework is composed of standalone modules, allowing for easy mixing and matching of different modules or integration with your own implementation. This provides flexibility to experiment and customize the retrieval model to meet your specific needs.

The structure of the `lsr` package is as following:

```.
├── configs #configuration of different components
│ ├── dataset
│ ├── experiment #define exp details: dataset, loss, model, hp
│ ├── loss
│ ├── model
│ └── wandb
├── datasets #implementations of dataset loading & collator
├── losses #implementations of different losses + regularizer
├── models #implementations of different models
├── tokenizer #a wrapper of HF's tokenizers
├── trainer #trainer for training
└── utils #common utilities used in different places
```

* The list of all configurations used in the paper could be found [here](#list-of-configurations-used-in-the-paper)

* The instruction for running experiments could be found [here](#training-and-inference-instructions)

## Training and inference instructions

### 1. Create conda environment and install dependencies:

Create `conda` environemt:
```
conda create --name lsr python=3.9.12
conda activate lsr
```
Install dependencies with `pip`
```
pip install -r requirements.txt
```

### 2. Downwload/Prepare datasets
We have included all pre-defined dataset configurations under `lsr/configs/dataset`. Before starting training, ensure that you have the `ir_datasets` and (huggingface) `datasets` libraries installed, as the framework will automatically download and store the necessary data to the correct directories.

For datasets from `ir_datasets`, the downloaded files are saved by default at `~/.ir_datasets/`. You can modify this path by changing the `IR_DATASETS_HOME` environment variable.

Similarly, for datasets from the HuggingFace's `datasets`, the downloaded files are stored at `~/.cache/huggingface/datasets` by default. To specify a different cache directory, set the `HF_DATASETS_CACHE` environment variable.

To train a customed model on your own dataset, please use the sample configurations under `lsr/config/dataset` as templates. Overall, you need three important files (see `lsr/dataset_utils` for the file format):
- document collection: maps `document_id` to `document_text`
- queries: maps `query_id` to `query_text`
- train triplets or scored pairs:
- train triplets, used for contrastive learning, contains a list of <`query_id`, `positive_document_id`, `negative_document_id`> triplets.
- scored_pairs, used for distillation training, contain pairs of <`query`, `document_id`> with a relevance score.

### 3. Train a model

To train a LSR model, you can just simply run the following command:

```bash
python -m lsr.train +experiment=sparta_msmarco_distil \
training_arguments.fp16=True
```
Please note that:
- In this command, `sparta_msmarco_distil` refers to the experiment configuration file located at `lsr/configs/experiment/sparta_msmarco_distil.yaml`. If you wish to use a different experiment, simply change this value to the name of the desired configuration file under `lsr/configs/experiment`.
- You may notice a `+` before `experiment=sparta_msmarco_distil`. This is a convention in Hydra to add a new configuration key (in this case, `experiment`) that is not yet defined in *lsr/configs/config.yaml*. If you want to override an existing key (e.g., `training_arguments.fp16`), you don't need to use the `+` symbol
- We trained some models using *NVIDIA A100 80GB*, allowing us to use large batch sizes (e.g., *128*). To replicate our experiments on smaller GPUs, reduce the batch size and increase the gradient accumulation steps (e.g., add `training_arguments.per_device_train_batch_size=64 +training_arguments.gradient_accumulation_steps=2` to your training command). Note: With models (e.g., Splade) using sparse regularizers during training, the results may still differ slightly since we don't take accumulation steps into account for adjusting regularization weights.
- We use `wandb` (by default) to monitor the training process, including loss, regularization, query length, and document length. If you wish to disable this feature, you can do so by adding `training_arguments.report_to='none'` to the above command. Alternatively, you can follow the instructions [here](https://docs.wandb.ai/ref/cli/wandb-login) to set up wandb.

### 4. Run inference on MSMARCO dataset

When the training finished, you can use our inference scripts to generate new queries and documents as following:

#### 4.1 Generate queries
```
input_path=data/msmarco/dev_queries/raw.tsv
output_file_name=raw.tsv
batch_size=256
type='query'
python -m lsr.inference \
inference_arguments.input_path=$input_path \
inference_arguments.output_file=$output_file_name \
inference_arguments.type=$type \
inference_arguments.batch_size=$batch_size \
inference_arguments.scale_factor=100 \
+experiment=sparta_msmarco_distil
```
#### 4.2 Generate documents
```
input_path=data/msmarco/full_collection/split/part01
output_file_name=part01
batch_size=256
type='doc'
python -m lsr.inference \
inference_arguments.input_path=$input_path \
inference_arguments.output_file=$output_file_name \
inference_arguments.type=$type \
inference_arguments.batch_size=$batch_size \
inference_arguments.scale_factor=100 \
inference_arguments.top_k=-400 \
+experiment=sparta_msmarco_distil \
```
Note:
- The `top_k` argument is the number of terms you want to keep; negative `top_k` means no pruning (all positive terms are kept).
- `scale_factor` is used for weight quantization; float weights are multiplied by this `scale_factor` and rounded to the nearest integer.
- The inference in document collection will take a long time. Therefore, it is better to split the collection into multiple partitions and run inference using multiple GPUs.
- All the generated queries and documents are stored in the`output/{exp_name}/inference/` directory by default, where the `exp_name` parameter is defined in the experiment configuration file. You can change it as you like.

### 5. Index generated documents
#### 5.1 Download and install our modified Anserini indexing software:
We made simple changes in the indexing procedure in Anserini to improve the indexing speed (by `10x`).
In the old method, Anserini first creates fake documents from JSON weight files (e.g., `{"hello": 3}`) by repeating the term (e.g., `"helo hello hello"`) and then indexes these documents as regular documents. The process of creating these fake documents can cause a substantial delay in indexing LSR where the number of terms and weights are usually large. To get rid of this issue, we leverage the [FeatureField](https://lucene.apache.org/core/9_3_0/core/org/apache/lucene/document/FeatureField.html) in Lucene to inject the (term, weight) pairs directly to the index. The change is simple but quite effective, especially when you have to index multiple times (as in the paper).
You can download the modified Anserini version [here](https://github.com/thongnt99/anserini-lsr), then follow the instructions in the [README](https://github.com/thongnt99/anserini-lsr#readme) for installation. If the tests fail, you can skip it by adding `-Dmaven.test.skip=true`.

When the installation is done, you can continue with the next steps.
#### 5.2 Index with Anserini
```
./anserini-lsr/target/appassembler/bin/IndexCollection \
-collection JsonSparseVectorCollection \
-input outputs/sparta_distil_sentence_transformers/inference/doc/ \
-index outputs/sparta_distil_sentence_transformers/index \
-generator SparseVectorDocumentGenerator \
-threads 60 -impact -pretokenized
```
Note that you have to change `sparta_distil_sentence_transformers` to the output defined in your experiment configuation flie (here: `lsr/configs/experiment/sparta_msmarco_distil.yaml`)
### 6. Search on the Inverted Index
```
./anserini-lsr/target/appassembler/bin/SearchCollection \
-index outputs/sparta_distil_sentence_transformers/index/ \
-topics outputs/sparta_distil_sentence_transformers/inference/query/raw.tsv \
-topicreader TsvString \
-output outputs/sparta_distil_sentence_transformers/run.trec \
-impact -pretokenized -hits 1000 -parallelism 60
```
Here, you may need to change the output directory as in 5.2.
### 7. Evaluate the run file
```
ir_measures qrels.msmarco-passage.dev-subset.txt outputs/sparta_distil_sentence_transformers/run.trec MRR@10 R@1000 NDCG@10
```
`qrels.msmarco-passage.dev-subset.txt` is the qrels file for MSMARCO-dev in TREC format. You can find it on the MSMARCO or TREC DL(19,20) website. Note that for TREC DL (19,20), you have to change `R@1000` to `"R(rel=2)@1000"` (with the quote).

## List of configurations used in the paper
* **RQ1: Are the results from recent LSR papers reproducible?**

Results in Table 3 are the outputs of following experiments:

| Method | Configuration |
| :-------- | :--------------|
| DeepCT | `lsr/configs/experiment/deepct_msmarco_term_level.yaml` |
| uniCOIL| `lsr/configs/experiment/unicoil_msmarco_multiple_negative.yaml` |
| uniCOIL_dT5q| `lsr/configs/experiment/unicoil_doct5query_msmarco_multiple_negative.yaml` |
| uniCOIL_tilde| `lsr/configs/experiment/unicoil_tilde_msmarco_multiple_negative.yaml` |
| EPIC | `lsr/configs/experiment/epic_original.yaml`|
| DeepImpact | `lsr/configs/experiment/deep_impact_original.yaml` |
| TILDE_v2| `lsr/configs/experiment/tildev2_msmarco_multiple_negative.yaml` |
| Sparta | `lsr/configs/experiment/sparta_original.yaml` |
| Splade_max| `lsr/configs/experiment/splade_msmarco_multiple_negative.yaml` |
| distilSplade_max|`lsr/configs/experiment/splade_msmarco_distil_flops_0.1_0.08.yaml`|

* **RQ2: How do LSR methods perform with recent advanced training
techniques?**

Results in Table 4 are the outputs of following experiments:

* **RQ3: How does the choice of encoder architecture and regularization
affect results?**

Results in Table 5 are the outputs of following experiments:
- MSMARCO Passage

| Effect | Row | Configuration |
| :-------- | :---- | :-------------- |
| Doc weighting | 1a | Before: `lsr/configs/experiment/splade_asm_dbin_msmarco_distil.yaml`
After: `lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml` |
| | 1b | Before: `lsr/configs/experiment/unicoil_dbin_tilde_msmarco_distil.yaml`
After: `lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml` |
| Query weighting | 2a | Before: `lsr/configs/experiment/tildev2_msmarco_distil.yaml`
After: `lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml`|
| | 2b | Before: `lsr/configs/experiment/epic_qbin_msmarco_distil.yaml`
After: `lsr/configs/experiment/epic_msmarco_distil.yaml`|
| Doc expansion | 3a | Before: `lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml`
After: `lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml`|
| | 3b | Before: `lsr/configs/experiment/unicoil_msmarco_distil.yaml`
After: `lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml` |
| Query expansion | 4a | Before: `splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml`
After: `lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml`|
| | 4b | Before: `lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml`
After: `lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml`|
| Regularization | 5a | Before: `lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml`
After: `lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.00.yaml`|

- Tripclick

| Effect | Row | Configuration |
| :-------- | :---- | :-------------- |
| Doc weighting | 1a | Before: `lsr/configs/experiment/qmlp_dbin_tripclick_multiple_negative.yaml`
After: `lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml` |
| | 1b | Before: `lsr/configs/experiment/qmlp_dexpbin_tripclick_multiple_negative.yaml`
After: `lsr/configs/experiment/unicoil_tilde_tripclick_multiple_negative.yaml` |
| Query weighting | 2a | Before: `lsr/configs/experiment/sparta_tripclick_multiple_negative.yaml`
After: `lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_0.0_0.0.yaml`|
| | 2b | Before: `lsr/configs/experiment/qbin_dmlp_tripclick_multiple_negative.yaml`
After: `lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml`|
| Doc expansion | 3a | Before: `lsr/configs/experiment/qmlm_dmlp_tripclick_hard_negative_l1_0.001.yaml`
After: `lsr/configs/experiment/splade_asm_tripclick_multiple_negative_l1_0.001_0.00001.yaml`|
| | 3b | Before: `lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml`
After: `lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml` |
| Query expansion | 4a | Before: `lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml`
After: `lsr/configs/experiment/splade_asm_tripclick_multiple_negative_l1_0.001_0.00001.yaml`|
| | 4b | Before: `lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml`
After: `lsr/configs/experiment/qmlm_dmlp_tripclick_hard_negative_l1_0.001.yaml`|
| Regularization | 5a | Before: `lsr/configs/experiment/epic_tripclick_multiple_negative.yaml`
After: `lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml`|

## Citing and Authors
If you find this repository helpful, feel free to cite our paper [A Unified Framework for Learned Sparse Retrieval](https://link.springer.com/chapter/10.1007/978-3-031-28241-6_7)

```bibtex
@inproceedings{nguyen2023unified,
title={A Unified Framework for Learned Sparse Retrieval},
author={Nguyen, Thong and MacAvaney, Sean and Yates, Andrew},
booktitle={Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2--6, 2023, Proceedings, Part III},
pages={101--116},
year={2023},
organization={Springer}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thongnt99/learned-sparse-retrieval

Awesome Lists containing this project

README