{"id":13676909,"url":"https://github.com/thongnt99/learned-sparse-retrieval","last_synced_at":"2026-01-17T07:11:39.128Z","repository":{"id":65585376,"uuid":"586806249","full_name":"thongnt99/learned-sparse-retrieval","owner":"thongnt99","description":"Unified Learned Sparse Retrieval Framework","archived":false,"fork":false,"pushed_at":"2024-05-13T00:41:14.000Z","size":282,"stargazers_count":58,"open_issues_count":4,"forks_count":6,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-11-11T18:43:46.113Z","etag":null,"topics":["learned-sparse-retrieval","lsr","neural-ir","sparse-retrieval","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thongnt99.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-09T09:21:37.000Z","updated_at":"2024-10-03T18:05:04.000Z","dependencies_parsed_at":"2023-02-16T19:30:41.107Z","dependency_job_id":"b7d5adde-f2e2-4340-a581-a5a73b2d9e57","html_url":"https://github.com/thongnt99/learned-sparse-retrieval","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thongnt99%2Flearned-sparse-retrieval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thongnt99%2Flearned-sparse-retrieval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thongnt99%2Flearned-sparse-retrieval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thongnt99%2Flearned-sparse-retrieval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thongnt99","download_url":"https://codeload.github.com/thongnt99/learned-sparse-retrieval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251456070,"owners_count":21592288,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["learned-sparse-retrieval","lsr","neural-ir","sparse-retrieval","transformers"],"created_at":"2024-08-02T13:00:34.600Z","updated_at":"2026-01-17T07:11:39.113Z","avatar_url":"https://github.com/thongnt99.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cimg src=\"images/logo.png\" width=6%\u003e ![](https://badgen.net/badge/lsr/instructions/red?icon=github) ![](https://badgen.net/badge/python/3.9.12/green?icon=python)\n[![DOI](https://zenodo.org/badge/586806249.svg)](https://zenodo.org/doi/10.5281/zenodo.10659499)\n\n# LSR: A unified framework for efficient and effective learned sparse retrieval\n\nThe framework provides a simple yet effective toolkit for defining, training, and evaluating learned sparse retrieval methods. The framework is composed of standalone modules, allowing for easy mixing and matching of different modules or integration with your own implementation. This provides flexibility to experiment and customize the retrieval model to meet your specific needs.\n\nThe structure of the `lsr` package is as following: \n\n```.\n├── configs  #configuration of different components\n│   ├── dataset \n│   ├── experiment #define exp details: dataset, loss, model, hp \n│   ├── loss \n│   ├── model\n│   └── wandb\n├── datasets    #implementations of dataset loading \u0026 collator\n├── losses  #implementations of different losses + regularizer\n├── models  #implementations of different models\n├── tokenizer   #a wrapper of HF's tokenizers\n├── trainer     #trainer for training \n└── utils   #common utilities used in different places\n```\n\n* The list of all configurations used in the paper could be found [here](#list-of-configurations-used-in-the-paper)\n\n* The instruction for running experiments could be found [here](#training-and-inference-instructions)\n\n## Training and inference instructions \n\n### 1. Create conda environment and install dependencies: \n\nCreate `conda` environemt:\n```\nconda create --name lsr python=3.9.12\nconda activate lsr\n```\nInstall dependencies with `pip`\n```\npip install -r requirements.txt\n```\n\n### 2. Downwload/Prepare datasets\n We have included all pre-defined dataset configurations under `lsr/configs/dataset`. Before starting training, ensure that you have the `ir_datasets` and (huggingface) `datasets` libraries installed, as the framework will automatically download and store the necessary data to the correct directories.\n\nFor datasets from `ir_datasets`, the downloaded files are saved by default at `~/.ir_datasets/`. You can modify this path by changing the `IR_DATASETS_HOME` environment variable.\n\nSimilarly, for datasets from the HuggingFace's `datasets`, the downloaded files are stored at `~/.cache/huggingface/datasets` by default. To specify a different cache directory, set the `HF_DATASETS_CACHE` environment variable. \n\nTo train a customed model on your own dataset, please use the sample configurations under `lsr/config/dataset` as templates. Overall, you need three important files (see `lsr/dataset_utils` for the file format): \n- document collection: maps `document_id` to `document_text` \n- queries: maps `query_id` to `query_text`\n- train triplets or scored pairs:\n    - train triplets, used for contrastive learning, contains a list of \u003c`query_id`, `positive_document_id`, `negative_document_id`\u003e triplets.\n    - scored_pairs, used for distillation training, contain pairs of \u003c`query`, `document_id`\u003e with a relevance score.  \n\n\n\n\u003c!-- #### 2.1 Hard negatives and  CE's scores for distillation\nThe full dataset consisting of hard negatives and CE's scores could be downloaded from [here](https://download.europe.naverlabs.com/splade/sigir22/data.tar.gz).\n\nTo use this dataset with our code, you need to put this dataset to the right directories specified in the coressponding data configuration file at `lsr/configs/dataset/msmarco_distil_nils.yaml`\n\n#### 2.2 BM25 negatives \n\nThe pretokenized BM25 negatives (+queries + positives) could downloaded from [here](http://boston.lti.cs.cmu.edu/luyug/coil/msmarco-psg/).\n\nSimilar to 2.1, you need to put this data to the directory specified in `lsr/configs/dataset/coil_pretokenized.yaml`\n\n#### 2.3 Pre-expanding the passages \nTo expand the passges with an external model (docT5query or TILDE), you can use resources like [here](https://huggingface.co/doc2query/msmarco-t5-base-v1) or [here](https://github.com/ielab/TILDE/blob/main/create_psg_train_with_tilde.py). We prepare  scripts for expanding passages with TILDE in: `lsr/preprocess`\n\n#### 2.4 Term-recall datasets: \nFor training DeepCT model, the term-recall dataset derived from MSMARCO relevant query-passage pairs could be downloaded [here](http://boston.lti.cs.cmu.edu/appendices/arXiv2019-DeepCT-Zhuyun-Dai/data/myalltrain.relevant.docterm_recall) --\u003e\n\n### 3. Train a model \n\nTo train a LSR model, you can just simply run the following command:\n\n```bash\npython -m lsr.train +experiment=sparta_msmarco_distil \\\ntraining_arguments.fp16=True \n```\nPlease note that:\n- In this command, `sparta_msmarco_distil` refers to the experiment configuration file located at `lsr/configs/experiment/sparta_msmarco_distil.yaml`. If you wish to use a different experiment, simply change this value to the name of the desired configuration file under `lsr/configs/experiment`.\n- You may notice a `+` before `experiment=sparta_msmarco_distil`. This is a convention in Hydra to add a new configuration key (in this case, `experiment`) that is not yet defined in *lsr/configs/config.yaml*. If you want to override an existing key (e.g., `training_arguments.fp16`), you don't need to use the `+` symbol\n- We trained some models using *NVIDIA A100 80GB*, allowing us to use large batch sizes (e.g., *128*). To replicate our experiments on smaller GPUs, reduce the batch size and increase the gradient accumulation steps (e.g., add `training_arguments.per_device_train_batch_size=64 +training_arguments.gradient_accumulation_steps=2` to your training command). Note: With models (e.g., Splade) using sparse regularizers during training, the results may still differ slightly since we don't take accumulation steps into account for adjusting regularization weights.   \n- We use `wandb` (by default) to monitor the training process, including loss, regularization, query length, and document length. If you wish to disable this feature, you can do so by adding `training_arguments.report_to='none'` to the above command. Alternatively, you can follow the instructions [here](https://docs.wandb.ai/ref/cli/wandb-login) to set up wandb.\n\n\n### 4. Run inference on MSMARCO dataset \n\nWhen the training finished, you can use our inference scripts to generate new queries and documents as following: \n\n#### 4.1 Generate queries\n```\ninput_path=data/msmarco/dev_queries/raw.tsv\noutput_file_name=raw.tsv\nbatch_size=256\ntype='query'\npython -m lsr.inference \\\ninference_arguments.input_path=$input_path \\\ninference_arguments.output_file=$output_file_name \\\ninference_arguments.type=$type \\\ninference_arguments.batch_size=$batch_size \\\ninference_arguments.scale_factor=100 \\\n+experiment=sparta_msmarco_distil \n```\n#### 4.2 Generate documents \n```\ninput_path=data/msmarco/full_collection/split/part01\noutput_file_name=part01\nbatch_size=256\ntype='doc'\npython -m lsr.inference \\\ninference_arguments.input_path=$input_path \\\ninference_arguments.output_file=$output_file_name \\\ninference_arguments.type=$type \\\ninference_arguments.batch_size=$batch_size \\\ninference_arguments.scale_factor=100 \\\ninference_arguments.top_k=-400  \\\n+experiment=sparta_msmarco_distil \\ \n```\nNote: \n- The `top_k` argument is the number of terms you want to keep; negative `top_k` means no pruning (all positive terms are kept).   \n- `scale_factor` is used for weight quantization; float weights are multiplied by this `scale_factor` and rounded to the nearest integer. \n- The inference in document collection will take a long time. Therefore, it is better to split the collection into multiple partitions and run inference using multiple GPUs. \n- All the generated queries and documents are stored in the`output/{exp_name}/inference/` directory by default, where the `exp_name` parameter is defined in the experiment configuration file. You can change it as you like. \n\n### 5. Index generated documents \n#### 5.1 Download and install our modified Anserini indexing software:\nWe made simple changes in the indexing procedure in Anserini to improve the indexing speed (by `10x`). \nIn the old method, Anserini first creates fake documents from JSON weight files (e.g., `{\"hello\": 3}`) by repeating the term (e.g., `\"helo hello hello\"`) and then indexes these documents as regular documents. The process of creating these fake documents can cause a substantial delay in indexing LSR where the number of terms and weights are usually large. To get rid of this issue, we leverage the [FeatureField](https://lucene.apache.org/core/9_3_0/core/org/apache/lucene/document/FeatureField.html) in Lucene to inject the (term, weight) pairs directly to the index. The change is simple but quite effective, especially when you have to index multiple times (as in the paper).   \nYou can download the modified Anserini version [here](https://github.com/thongnt99/anserini-lsr), then follow the instructions in the [README](https://github.com/thongnt99/anserini-lsr#readme) for installation. If the tests fail, you can skip it by adding `-Dmaven.test.skip=true`.\n\nWhen the installation is done, you can continue with the next steps. \n#### 5.2 Index with Anserini\n```\n./anserini-lsr/target/appassembler/bin/IndexCollection \\\n-collection JsonSparseVectorCollection \\\n-input outputs/sparta_distil_sentence_transformers/inference/doc/  \\\n-index outputs/sparta_distil_sentence_transformers/index \\\n-generator SparseVectorDocumentGenerator \\\n-threads 60 -impact -pretokenized\n```\nNote that you have to change `sparta_distil_sentence_transformers` to the output defined in your experiment configuation flie (here: `lsr/configs/experiment/sparta_msmarco_distil.yaml`)\n### 6. Search on the Inverted Index\n```\n./anserini-lsr/target/appassembler/bin/SearchCollection \\\n-index outputs/sparta_distil_sentence_transformers/index/  \\\n-topics outputs/sparta_distil_sentence_transformers/inference/query/raw.tsv \\\n-topicreader TsvString \\\n-output outputs/sparta_distil_sentence_transformers/run.trec \\\n-impact -pretokenized -hits 1000 -parallelism 60\n```\nHere, you may need to change the output directory as in 5.2. \n### 7. Evaluate the run file\n```\nir_measures qrels.msmarco-passage.dev-subset.txt outputs/sparta_distil_sentence_transformers/run.trec MRR@10 R@1000 NDCG@10\n```\n`qrels.msmarco-passage.dev-subset.txt` is the qrels file for MSMARCO-dev in TREC format. You can find it on the MSMARCO or TREC DL(19,20) website. Note that for TREC DL (19,20), you have to change `R@1000` to `\"R(rel=2)@1000\"` (with the quote). \n\n## List of configurations used in the paper \n* **RQ1: Are the results from recent LSR papers reproducible?**\n\nResults in Table 3 are the outputs of following experiments: \n\n|  Method  | Configuration  |\n| :-------- | :--------------|\n| DeepCT | `lsr/configs/experiment/deepct_msmarco_term_level.yaml` |\n| uniCOIL| `lsr/configs/experiment/unicoil_msmarco_multiple_negative.yaml` |\n| uniCOIL\u003csub\u003edT5q\u003c/sub\u003e| `lsr/configs/experiment/unicoil_doct5query_msmarco_multiple_negative.yaml` | \n| uniCOIL\u003csub\u003etilde\u003c/sub\u003e| `lsr/configs/experiment/unicoil_tilde_msmarco_multiple_negative.yaml` | \n| EPIC | `lsr/configs/experiment/epic_original.yaml`| \n| DeepImpact | `lsr/configs/experiment/deep_impact_original.yaml` | \n| TILDE\u003csub\u003ev2\u003c/sub\u003e| `lsr/configs/experiment/tildev2_msmarco_multiple_negative.yaml` |\n| Sparta | `lsr/configs/experiment/sparta_original.yaml` |\n| Splade\u003csub\u003emax\u003c/sub\u003e| `lsr/configs/experiment/splade_msmarco_multiple_negative.yaml` |\n| distilSplade\u003csub\u003emax\u003c/sub\u003e|`lsr/configs/experiment/splade_msmarco_distil_flops_0.1_0.08.yaml`|\n\n\n* **RQ2: How do LSR methods perform with recent advanced training\ntechniques?**\n\nResults in Table 4 are the outputs of following experiments: \n\n|  Method  | Configuration  |\n| :-------- | :-------------- |\n| uniCOIL| `lsr/configs/experiment/unicoil_msmarco_distil.yaml` |\n| uniCOIL\u003csub\u003edT5q\u003c/sub\u003e| `lsr/configs/experiment/unicoil_doct5query_msmarco_distil.yaml`| \n| uniCOIL\u003csub\u003etilde\u003c/sub\u003e| `lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml` | \n| EPIC | `lsr/configs/experiment/epic_msmarco_distil.yaml` | \n| DeepImpact | `lsr/configs/experiment/deep_impact_msmarco_distil.yaml` | \n| TILDE\u003csub\u003ev2\u003c/sub\u003e| `lsr/configs/experiment/tildev2_msmarco_distil.yaml` |\n| Sparta | `lsr/configs/experiment/sparta_msmarco_distil.yaml` |\n| distilSplade\u003csub\u003emax\u003c/sub\u003e|`lsr/configs/experiment/splade_msmarco_distil_flops_0.1_0.08.yaml` |\n| distilSplade\u003csub\u003esep\u003c/sub\u003e| `lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml`|\n\n* **RQ3: How does the choice of encoder architecture and regularization\naffect results?**\n\nResults in Table 5 are the outputs of following experiments: \n- MSMARCO Passage\n\n|  Effect  |  Row | Configuration  |\n| :-------- | :---- | :-------------- |\n| Doc weighting | 1a | Before: `lsr/configs/experiment/splade_asm_dbin_msmarco_distil.yaml` \u003cbr\u003e After: `lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml`  |\n|  | 1b | Before: `lsr/configs/experiment/unicoil_dbin_tilde_msmarco_distil.yaml` \u003cbr\u003e After: `lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml` |\n| Query weighting | 2a | Before: `lsr/configs/experiment/tildev2_msmarco_distil.yaml` \u003cbr\u003e After: `lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml`|\n|  | 2b | Before: `lsr/configs/experiment/epic_qbin_msmarco_distil.yaml` \u003cbr\u003e After: `lsr/configs/experiment/epic_msmarco_distil.yaml`|\n| Doc expansion | 3a | Before: `lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml` \u003cbr\u003e After: `lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml`|\n|  | 3b | Before: `lsr/configs/experiment/unicoil_msmarco_distil.yaml` \u003cbr\u003e After: `lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml` |\n| Query expansion | 4a | Before: `splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml` \u003cbr\u003e After: `lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml`|\n|  | 4b | Before: `lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml` \u003cbr\u003e After: `lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml`|\n| Regularization | 5a | Before: `lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml` \u003cbr\u003e After: `lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.00.yaml`|\n\n- Tripclick\n\n|  Effect  |  Row | Configuration  |\n| :-------- | :---- | :-------------- |\n| Doc weighting | 1a | Before: `lsr/configs/experiment/qmlp_dbin_tripclick_multiple_negative.yaml` \u003cbr\u003e After: `lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml`  |\n|  | 1b | Before: `lsr/configs/experiment/qmlp_dexpbin_tripclick_multiple_negative.yaml` \u003cbr\u003e After: `lsr/configs/experiment/unicoil_tilde_tripclick_multiple_negative.yaml` |\n| Query weighting | 2a | Before: `lsr/configs/experiment/sparta_tripclick_multiple_negative.yaml` \u003cbr\u003e After: `lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_0.0_0.0.yaml`|\n|  | 2b | Before: `lsr/configs/experiment/qbin_dmlp_tripclick_multiple_negative.yaml` \u003cbr\u003e After: `lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml`|\n| Doc expansion | 3a | Before: `lsr/configs/experiment/qmlm_dmlp_tripclick_hard_negative_l1_0.001.yaml` \u003cbr\u003e After: `lsr/configs/experiment/splade_asm_tripclick_multiple_negative_l1_0.001_0.00001.yaml`|\n|  | 3b | Before: `lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml` \u003cbr\u003e After: `lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml` |\n| Query expansion | 4a | Before: `lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml` \u003cbr\u003e After: `lsr/configs/experiment/splade_asm_tripclick_multiple_negative_l1_0.001_0.00001.yaml`|\n|  | 4b | Before: `lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml` \u003cbr\u003e After: `lsr/configs/experiment/qmlm_dmlp_tripclick_hard_negative_l1_0.001.yaml`|\n| Regularization | 5a | Before: `lsr/configs/experiment/epic_tripclick_multiple_negative.yaml` \u003cbr\u003e After: `lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml`|\n\n## Citing and Authors\nIf you find this repository helpful, feel free to cite our paper [A Unified Framework for Learned Sparse Retrieval](https://link.springer.com/chapter/10.1007/978-3-031-28241-6_7)\n\n```bibtex\n@inproceedings{nguyen2023unified,\n  title={A Unified Framework for Learned Sparse Retrieval},\n  author={Nguyen, Thong and MacAvaney, Sean and Yates, Andrew},\n  booktitle={Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2--6, 2023, Proceedings, Part III},\n  pages={101--116},\n  year={2023},\n  organization={Springer}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthongnt99%2Flearned-sparse-retrieval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthongnt99%2Flearned-sparse-retrieval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthongnt99%2Flearned-sparse-retrieval/lists"}