{"id":13677995,"url":"https://github.com/naver/splade","last_synced_at":"2025-05-16T15:05:49.004Z","repository":{"id":37979337,"uuid":"385813532","full_name":"naver/splade","owner":"naver","description":"SPLADE: sparse neural search (SIGIR21, SIGIR22)","archived":false,"fork":false,"pushed_at":"2024-05-03T14:52:32.000Z","size":3254,"stargazers_count":834,"open_issues_count":15,"forks_count":90,"subscribers_count":21,"default_branch":"main","last_synced_at":"2025-04-03T11:11:13.544Z","etag":null,"topics":["bert","information-retrieval","nlp","passage-retrieval","sparse","splade"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/naver.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-14T04:30:58.000Z","updated_at":"2025-04-02T14:45:13.000Z","dependencies_parsed_at":"2023-02-05T21:01:39.292Z","dependency_job_id":"d1396772-a181-4388-a7bb-eb735a6d472f","html_url":"https://github.com/naver/splade","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naver%2Fsplade","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naver%2Fsplade/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naver%2Fsplade/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naver%2Fsplade/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/naver","download_url":"https://codeload.github.com/naver/splade/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248564977,"owners_count":21125412,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","information-retrieval","nlp","passage-retrieval","sparse","splade"],"created_at":"2024-08-02T13:00:49.220Z","updated_at":"2025-04-12T11:50:42.271Z","avatar_url":"https://github.com/naver.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# SPLADE\n[![paper](https://img.shields.io/badge/arxiv-arXiv%3A2107.05720-brightgreen)](https://arxiv.org/abs/2107.05720)\n[![blog](https://img.shields.io/badge/blog-splade-orange)](https://europe.naverlabs.com/blog/splade-a-sparse-bi-encoder-bert-based-model-achieves-effective-and-efficient-first-stage-ranking/)\n[![huggingface weights](https://img.shields.io/badge/huggingface-splade-9cf)](https://huggingface.co/naver)\n[![weights](https://img.shields.io/badge/weights-splade-blue)](https://europe.naverlabs.com/research/machine-learning-and-optimization/splade-models/)\n\n## What's New:\n* November 2023: Better training code for SPLADE and rerankers training (e.g, cross encoders, RankT5) available; new models coming soon on github!\n* July 2023: We add the code for static pruning SPLADE indexes in order to reproduce [A Static Pruning Study on Sparse Neural Retrievers](https://arxiv.org/abs/2304.12702)\n* May 2023:  We add a new branch (based on HF Trainer) allowing training with several negatives : https://github.com/naver/splade/tree/hf\n* April 2023: We have removed the weights and pushed them to huggingface (https://huggingface.co/naver/splade_v2_max and https://huggingface.co/naver/splade_v2_distil) \n\n\n\n\u003cimg src=\"./images/splade_figure.png\" width=\"650\"\u003e\n\nThis repository contains the code to perform **training**, **indexing** and **retrieval** for SPLADE models. It also\nincludes everything needed to launch evaluation on the [BEIR](https://github.com/beir-cellar/beir) benchmark.\n\n**TL; DR**\nSPLADE is a neural retrieval model which learns query/document **sparse** expansion via the BERT MLM head and sparse\nregularization. Sparse representations benefit from several advantages compared to dense approaches: efficient use of\ninverted index, explicit lexical match, interpretability... They also seem to be better at generalizing on out-of-domain\ndata (BEIR benchmark).\n\n* (v1, SPLADE) [SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking](https://arxiv.org/abs/2107.05720), *Thibault Formal*, *Benjamin Piwowarski*\n  and *Stéphane Clinchant*. SIGIR21 short paper.\n\nBy benefiting from recent advances in training neural retrievers, our **v2** models rely on hard-negative mining,\ndistillation and better Pre-trained Language Model initialization to further increase their **effectiveness**, on both\nin-domain (MS MARCO) and out-of-domain evaluation (BEIR benchmark).\n\n* (v2, SPLADE v2) [SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval](https://arxiv.org/abs/2109.10086), *Thibault Formal*, *Benjamin\n  Piwowarski*, *Carlos Lassance*, and *Stéphane Clinchant*. arxiv.\n* (v2bis, SPLADE++) [From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective](http://arxiv.org/abs/2205.04733), *Thibault\n  Formal*, *Carlos Lassance*, *Benjamin Piwowarski*, and *Stéphane Clinchant*. SIGIR22 short paper (*extension of SPLADE v2*).\n\nFinally, by introducing several modifications (query specific regularization, disjoint encoders etc.), we are able to improve **efficiency**, achieving latency on par with BM25 under the same computing constraints.\n\n* (efficient SPLADE) [An Efficiency Study for SPLADE Models](https://dl.acm.org/doi/10.1145/3477495.3531833), *Carlos Lassance* and *Stéphane Clinchant*. SIGIR22 short paper.\n\nWeights for models trained under various settings can be found\non [Naver Labs Europe website](https://europe.naverlabs.com/research/machine-learning-and-optimization/splade-models/),\nas well as [Hugging Face](https://huggingface.co/naver). Please bear in mind that SPLADE is more a class of models\nrather than a model per se: depending on the regularization magnitude, we can obtain different models (from very sparse\nto models doing intense query/doc expansion) with different properties and performance.\n\n*splade: a spork that is sharp along one edge or both edges, enabling it to be used as a knife, a fork and a spoon.*\n\n***\n\n# Getting started :rocket:\n\n## Requirements\n\nWe recommend to start from a fresh environment, and install the packages from `conda_splade_env.yml`.\n\n```\nconda create -n splade_env python=3.9\nconda activate splade_env\nconda env create -f conda_splade_env.yml\n```\n\n## Usage\n\n### Playing with the model\n\n`inference_splade.ipynb` allows you to load and perform inference with a trained model, in order to inspect the\npredicted \"bag-of-expanded-words\". We provide weights for six main models:\n\n| model | MRR@10 (MS MARCO dev) | \n| --- | --- |\n| `naver/splade_v2_max` (**v2** [HF](https://huggingface.co/naver/splade_v2_max)) | 34.0 |\n| `naver/splade_v2_distil` (**v2** [HF](https://huggingface.co/naver/splade_v2_distil)) | 36.8 |\n| `naver/splade-cocondenser-selfdistil` (**SPLADE++**, [HF](https://huggingface.co/naver/splade-cocondenser-selfdistil)) | 37.6 | \n| `naver/splade-cocondenser-ensembledistil` (**SPLADE++**, [HF](https://huggingface.co/naver/splade-cocondenser-ensembledistil)) | 38.3 |\n| `naver/efficient-splade-V-large-doc` ([HF](https://huggingface.co/naver/efficient-splade-V-large-doc)) + `naver/efficient-splade-V-large-query` ([HF](https://huggingface.co/naver/efficient-splade-V-large-query)) (**efficient SPLADE**) | 38.8 |\n| `naver/efficient-splade-VI-BT-large-doc` ([HF](https://huggingface.co/naver/efficient-splade-VI-BT-large-doc)) + `efficient-splade-VI-BT-large-query` ([HF](https://huggingface.co/naver/efficient-splade-VI-BT-large-query)) (**efficient SPLADE**) | 38.0 |\n\n\nWe also uploaded various\nmodels [here](https://europe.naverlabs.com/research/machine-learning-and-optimization/splade-models/). Feel free to try\nthem out!\n\n### High level overview of the code structure\n\n* This repository lets you either train (`train.py`), index (`index.py`), retrieve (`retrieve.py`) (or perform every\n  step with `all.py`) SPLADE models.\n* To manage experiments, we rely on [hydra](https://github.com/facebookresearch/hydra). Please refer\n  to [conf/README.md](conf/README.md) for a complete guide on how we configured experiments.\n\n### Data\n\n* To train models, we rely on [MS MARCO](https://github.com/microsoft/MSMARCO-Passage-Ranking) data.\n* We also further rely on distillation and hard negative mining, from available\n  datasets ([Margin MSE Distillation](https://github.com/sebastian-hofstaetter/neural-ranking-kd)\n  , [Sentence Transformers Hard Negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives))\n  or datasets we built ourselves (e.g. negatives mined from SPLADE).\n* Most of the data formats are pretty standard; for validation, we rely on an approximate validation set, following a\n  setting similar to [TAS-B](https://arxiv.org/abs/2104.06967).\n\nTo simplify setting up, we made available all our data folders, which can\nbe [downloaded here](https://download.europe.naverlabs.com/splade/sigir22/data.tar.gz). This link includes queries,\ndocuments and hard negative data, allowing for training under the `EnsembleDistil` setting (see v2bis paper). For\nother settings (`Simple`, `DistilMSE`, `SelfDistil`), you also have to download:\n\n* [(`Simple`) standard BM25 Triplets](https://download.europe.naverlabs.com/splade/sigir22/triplets.tar.gz)\n* [(`DistilMSE`) \"Vienna\" triplets for MarginMSE distillation](https://www.dropbox.com/s/sl07yvse3rlowxg/vienna_triplets.tar.gz?dl=0)\n* [(`SelfDistil`) triplets mined from SPLADE](https://download.europe.naverlabs.com/splade/sigir22/splade_triplets.tar.gz)\n\nAfter downloading, you can just untar in the root directory, and it will be placed in the right folder.\n\n```\ntar -xzvf file.tar.gz\n```\n\n### Quick start\n\nIn order to perform all steps (here on toy data, i.e. `config_default.yaml`), go on the root directory and run:\n\n```bash\nconda activate splade_env\nexport PYTHONPATH=$PYTHONPATH:$(pwd)\nexport SPLADE_CONFIG_NAME=\"config_default.yaml\"\npython3 -m splade.all \\\n  config.checkpoint_dir=experiments/debug/checkpoint \\\n  config.index_dir=experiments/debug/index \\\n  config.out_dir=experiments/debug/out\n```\n\n### Additional examples\n\nWe provide additional examples that can be plugged in the above code. See [conf/README.md](conf/README.md) for details\non how to change experiment settings.\n\n* you can similarly run training `python3 -m splade.train` (same for indexing or retrieval)\n* to create Anserini readable files (after training),\n  run `SPLADE_CONFIG_FULLPATH=/path/to/checkpoint/dir/config.yaml python3 -m splade.create_anserini +quantization_factor_document=100 +quantization_factor_query=100`\n* config files for various settings (distillation etc.) are available in `/conf`. For instance, to run the `SelfDistil`\n  setting:\n    * change to `SPLADE_CONFIG_NAME=config_splade++_selfdistil.yaml`\n    * to further change parameters (e.g. lambdas) *outside* the config,\n      run: `python3 -m splade.all config.regularizer.FLOPS.lambda_q=0.06 config.regularizer.FLOPS.lambda_d=0.02`\n\nWe provide several base configurations which correspond to the experiments in the v2bis and \"efficiency\" papers. Please note that these are\nsuited for our hardware setting, i.e. 4 GPUs Tesla V100 with 32GB memory. In order to train models with e.g. one GPU,\nyou need to decrease the batch size for training and evaluation. Also note that, as the range for the loss might change\nwith a different batch size, corresponding lambdas for regularization might need to be adapted. However, we provide a mono-gpu configuration\n`config_splade++_cocondenser_ensembledistil_monogpu.yaml` for which we obtain 37.2 MRR@10, trained on a single 16GB GPU.\n\n### Evaluating a pre-trained model\n\nIndexing (and retrieval) can be done either using our (numba-based) implementation of inverted index,\nor [Anserini](https://github.com/castorini/anserini). Let's perform these steps using an available model (`naver/splade-cocondenser-ensembledistil`).\n\n```bash\nconda activate splade_env\nexport PYTHONPATH=$PYTHONPATH:$(pwd)\nexport SPLADE_CONFIG_NAME=\"config_splade++_cocondenser_ensembledistil\"\npython3 -m splade.index \\\n  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \\\n  config.pretrained_no_yamlconfig=true \\\n  config.index_dir=experiments/pre-trained/index\npython3 -m splade.retrieve \\\n  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \\\n  config.pretrained_no_yamlconfig=true \\\n  config.index_dir=experiments/pre-trained/index \\\n  config.out_dir=experiments/pre-trained/out\n# pretrained_no_yamlconfig indicates that we solely rely on a HF-valid model path\n``` \n\n* To change the data, simply override the hydra retrieve_evaluate package, e.g. add `retrieve_evaluate=msmarco` as argument of `splade.retrieve`.\n\nYou can similarly build the files that will be ingested by Anserini:\n\n```bash\npython3 -m splade.create_anserini \\\n  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \\\n  config.pretrained_no_yamlconfig=true \\\n  config.index_dir=experiments/pre-trained/index \\\n  +quantization_factor_document=100 \\\n  +quantization_factor_query=100\n```\n\nIt will create the json collection (`docs_anserini.jsonl`) as well as the queries (`queries_anserini.tsv`) that are\nneeded for Anserini. You then just need to follow the regression for\nSPLADE [here](https://github.com/castorini/anserini/blob/master/docs/regressions-msmarco-passage-distill-splade-max.md) in order to index and retrieve.\n\n### BEIR eval\n\nYou can also run evaluation on BEIR, for instance:\n\n```bash\nconda activate splade_env\nexport PYTHONPATH=$PYTHONPATH:$(pwd)\nexport SPLADE_CONFIG_FULLPATH=\"/path/to/checkpoint/dir/config.yaml\"\nfor dataset in arguana fiqa nfcorpus quora scidocs scifact trec-covid webis-touche2020 climate-fever dbpedia-entity fever hotpotqa nq\ndo\n    python3 -m splade.beir_eval \\\n      +beir.dataset=$dataset \\\n      +beir.dataset_path=data/beir \\\n      config.index_retrieve_batch_size=100\ndone\n```\n\n### PISA evaluation\n\nWe provide in `efficient_splade_pisa/README.md` the steps to evaluate efficient SPLADE models with PISA.\n\n***\n\n# Cite :scroll:\n\nPlease cite our work as:\n\n* (v1) SIGIR21 short paper\n\n```\n@inbook{10.1145/3404835.3463098,\nauthor = {Formal, Thibault and Piwowarski, Benjamin and Clinchant, St\\'{e}phane},\ntitle = {SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking},\nyear = {2021},\nisbn = {9781450380379},\npublisher = {Association for Computing Machinery},\naddress = {New York, NY, USA},\nurl = {https://doi.org/10.1145/3404835.3463098},\nbooktitle = {Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},\npages = {2288–2292},\nnumpages = {5}\n}\n```\n\n* (v2) arxiv\n\n```\n@misc{https://doi.org/10.48550/arxiv.2109.10086,\n  doi = {10.48550/ARXIV.2109.10086},\n  url = {https://arxiv.org/abs/2109.10086},\n  author = {Formal, Thibault and Lassance, Carlos and Piwowarski, Benjamin and Clinchant, Stéphane},\n  keywords = {Information Retrieval (cs.IR), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},\n  title = {SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval},\n  publisher = {arXiv},\n  year = {2021},\n  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}\n}\n```\n\n* (v2bis) SPLADE++, SIGIR22 short paper\n\n```\n@inproceedings{10.1145/3477495.3531857,\nauthor = {Formal, Thibault and Lassance, Carlos and Piwowarski, Benjamin and Clinchant, St\\'{e}phane},\ntitle = {From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},\nyear = {2022},\nisbn = {9781450387323},\npublisher = {Association for Computing Machinery},\naddress = {New York, NY, USA},\nurl = {https://doi.org/10.1145/3477495.3531857},\ndoi = {10.1145/3477495.3531857},\nabstract = {Neural retrievers based on dense representations combined with Approximate Nearest Neighbors search have recently received a lot of attention, owing their success to distillation and/or better sampling of examples for training -- while still relying on the same backbone architecture. In the meantime, sparse representation learning fueled by traditional inverted indexing techniques has seen a growing interest, inheriting from desirable IR priors such as explicit lexical matching. While some architectural variants have been proposed, a lesser effort has been put in the training of such models. In this work, we build on SPLADE -- a sparse expansion-based retriever -- and show to which extent it is able to benefit from the same training improvements as dense models, by studying the effect of distillation, hard-negative mining as well as the Pre-trained Language Model initialization. We furthermore study the link between effectiveness and efficiency, on in-domain and zero-shot settings, leading to state-of-the-art results in both scenarios for sufficiently expressive models.},\nbooktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},\npages = {2353–2359},\nnumpages = {7},\nkeywords = {neural networks, indexing, sparse representations, regularization},\nlocation = {Madrid, Spain},\nseries = {SIGIR '22}\n}\n```\n\n* efficient SPLADE, SIGIR22 short paper\n\n```\n@inproceedings{10.1145/3477495.3531833,\nauthor = {Lassance, Carlos and Clinchant, St\\'{e}phane},\ntitle = {An Efficiency Study for SPLADE Models},\nyear = {2022},\nisbn = {9781450387323},\npublisher = {Association for Computing Machinery},\naddress = {New York, NY, USA},\nurl = {https://doi.org/10.1145/3477495.3531833},\ndoi = {10.1145/3477495.3531833},\nabstract = {Latency and efficiency issues are often overlooked when evaluating IR models based on Pretrained Language Models (PLMs) in reason of multiple hardware and software testing scenarios. Nevertheless, efficiency is an important part of such systems and should not be overlooked. In this paper, we focus on improving the efficiency of the SPLADE model since it has achieved state-of-the-art zero-shot performance and competitive results on TREC collections. SPLADE efficiency can be controlled via a regularization factor, but solely controlling this regularization has been shown to not be efficient enough. In order to reduce the latency gap between SPLADE and traditional retrieval systems, we propose several techniques including L1 regularization for queries, a separation of document/query encoders, a FLOPS-regularized middle-training, and the use of faster query encoders. Our benchmark demonstrates that we can drastically improve the efficiency of these models while increasing the performance metrics on in-domain data. To our knowledge, we propose the first neural models that, under the same computing constraints, achieve similar latency (less than 4ms difference) as traditional BM25, while having similar performance (less than 10% MRR@10 reduction) as the state-of-the-art single-stage neural rankers on in-domain data.},\nbooktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},\npages = {2220–2226},\nnumpages = {7},\nkeywords = {splade, sparse representations, latency, information retrieval},\nlocation = {Madrid, Spain},\nseries = {SIGIR '22}\n}\n```\n\n***\n\n# Contact :mailbox_with_no_mail:\n\nFeel free to contact us via [Twitter](https://twitter.com/thibault_formal) or by mail @ thibault.formal@naverlabs.com !\n\n# License\n\nSPLADE Copyright (c) 2021-present NAVER Corp.\n\nSPLADE is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.\n(see [license](license.txt))\n\nYou should have received a copy of the license along with this work. If not,\nsee http://creativecommons.org/licenses/by-nc-sa/4.0/ .\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnaver%2Fsplade","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnaver%2Fsplade","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnaver%2Fsplade/lists"}