{"id":31296729,"url":"https://github.com/rasyosef/splade-index","last_synced_at":"2025-09-24T21:03:29.466Z","repository":{"id":306335635,"uuid":"1025769767","full_name":"rasyosef/splade-index","owner":"rasyosef","description":"Fast search index for SPLADE sparse retrieval models implemented in Python using Numpy and Numba","archived":false,"fork":false,"pushed_at":"2025-09-14T01:55:17.000Z","size":2242,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-16T08:29:30.450Z","etag":null,"topics":["information-retrieval","numba","numpy","python","retrieval","search","search-index","sentence-transformers","sparse-embedding","sparse-retrieval","splade"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rasyosef.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-24T19:15:18.000Z","updated_at":"2025-09-14T01:55:20.000Z","dependencies_parsed_at":"2025-07-25T04:29:11.106Z","dependency_job_id":"99141910-2f76-4f22-8166-aed19a2193fc","html_url":"https://github.com/rasyosef/splade-index","commit_stats":null,"previous_names":["rasyosef/splade-index"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rasyosef/splade-index","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasyosef%2Fsplade-index","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasyosef%2Fsplade-index/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasyosef%2Fsplade-index/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasyosef%2Fsplade-index/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rasyosef","download_url":"https://codeload.github.com/rasyosef/splade-index/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasyosef%2Fsplade-index/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275432596,"owners_count":25463759,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-16T02:00:10.229Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["information-retrieval","numba","numpy","python","retrieval","search","search-index","sentence-transformers","sparse-embedding","sparse-retrieval","splade"],"created_at":"2025-09-24T21:03:02.765Z","updated_at":"2025-09-24T21:03:29.461Z","avatar_url":"https://github.com/rasyosef.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SPLADE-Index⚡\n\n\u003ci\u003e\nSPLADE-Index is an ultrafast search index for SPLADE sparse retrieval models implemented in pure Python. It is built on top of the BM25s library.\n\u003c/i\u003e\n\u003cbr/\u003e\u003cbr/\u003e\n\nSPLADE is a neural retrieval model which learns query/document sparse expansion. Sparse representations benefit from several advantages compared to dense approaches: efficient use of inverted index, explicit lexical match, interpretability... They also seem to be better at generalizing on out-of-domain data (BEIR benchmark).\n\nFor more information about SPLADE models, please refer to the following. \n - [SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking](https://arxiv.org/abs/2107.05720)\n - [List of Pretrained Sparse Encoder (Sparse Embeddings) Models](https://sbert.net/docs/sparse_encoder/pretrained_models.html)\n - [Training and Finetuning Sparse Embedding Models with Sentence Transformers v5](https://huggingface.co/blog/train-sparse-encoder).\n\n## Installation\n\nYou can install `splade-index` with pip:\n\n```bash\npip install splade-index\n```\n\n## Quickstart\n\nHere is a simple example of how to use `splade-index`:\n\n```python\nfrom sentence_transformers import SparseEncoder\nfrom splade_index import SPLADE\n\n# Download a SPLADE model from the 🤗 Hub\nmodel = SparseEncoder(\"rasyosef/splade-tiny\")\n\n# Create your corpus here\ncorpus = [\n    \"a cat is a feline and likes to purr\",\n    \"a dog is the human's best friend and loves to play\",\n    \"a bird is a beautiful animal that can fly\",\n    \"a fish is a creature that lives in water and swims\",\n]\n\n# Create the SPLADE retriever and index the corpus\nretriever = SPLADE()\nretriever.index(model=model, documents=corpus)\n\n# Query the corpus\nqueries = [\"does the fish purr like a cat?\"]\n\n# Get top-k results as a tuple of (doc ids, documents, scores). All three are arrays of shape (n_queries, k).\nresults = retriever.retrieve(queries, k=2)\ndoc_ids, result_docs, scores = results.doc_ids, results.documents, results.scores\n\nfor i in range(doc_ids.shape[1]):\n    doc_id, doc, score = doc_ids[0, i], result_docs[0, i], scores[0, i]\n    print(f\"Rank {i+1} (score: {score:.2f}) (doc_id: {doc_id}): {doc}\")\n\n# You can save the index to a directory\nretriever.save(\"animal_index_splade\")\n\n# ...and load it when you need it\nimport splade_index\n\nreloaded_retriever = splade_index.SPLADE.load(\"animal_index_splade\", model=model)\n```\n\n\n## Hugging Face Integration\n\n`splade-index` can naturally work with Hugging Face's `huggingface_hub`, allowing you to load and save your index to the model hub.\n\nFirst, make sure you have a valid [access token for the Hugging Face model hub](https://huggingface.co/settings/tokens). This is needed to save models to the hub, or to load private models. Once you created it, you can add it to your environment variables:\n\n```bash\nexport HF_TOKEN=\"hf_...\"\n```\n\nNow, let's install the `huggingface_hub` library:\n\n```bash\npip install huggingface_hub\n```\n\nLet's see how to use `SPLADE.save_to_hub` to save a SPLADE index to the Hugging Face model hub:\n\n```python\nimport os\nfrom sentence_transformers import SparseEncoder\nfrom splade_index import SPLADE\n\n# Download a SPLADE model from the 🤗 Hub\nmodel = SparseEncoder(\"rasyosef/splade-tiny\")\n\n# Create your corpus here\ncorpus = [\n    \"a cat is a feline and likes to purr\",\n    \"a dog is the human's best friend and loves to play\",\n    \"a bird is a beautiful animal that can fly\",\n    \"a fish is a creature that lives in water and swims\",\n]\n\n# Create the SPLADE retriever and index the corpus\nretriever = SPLADE()\nretriever.index(model=model, documents=corpus)\n\n# Set your username and token\nuser = \"your-username\"\ntoken = os.environ[\"HF_TOKEN\"]\nrepo_id = f\"{user}/splade-index-animals\"\n\n# Save the index on your huggingface account\nretriever.save_to_hub(repo_id, token=token)\n# You can also save it publicly with private=False\n```\n\nThen, you can use the following code to load a SPLADE index from the Hugging Face model hub:\n\n```python\nimport os\nfrom sentence_transformers import SparseEncoder\nfrom splade_index import SPLADE\n\n# Download a SPLADE model from the 🤗 Hub\nmodel = SparseEncoder(\"rasyosef/splade-tiny\")\n\n# Set your huggingface username and token\nuser = \"your-username\"\ntoken = os.environ[\"HF_TOKEN\"]\nrepo_id = f\"{user}/splade-index-animals\"\n\n# Load a SPLADE index from the Hugging Face model hub\nretriever = SPLADE.load_from_hub(repo_id, model=model, token=token)\n\n# Query the corpus\nqueries = [\"does the fish purr like a cat?\"]\n\n# Get top-k results as a tuple of (doc ids, documents, scores). All three are arrays of shape (n_queries, k).\nresults = retriever.retrieve(queries, k=2)\ndoc_ids, result_docs, scores = results.doc_ids, results.documents, results.scores\n\nfor i in range(doc_ids.shape[1]):\n    doc_id, doc, score = doc_ids[0, i], result_docs[0, i], scores[0, i]\n    print(f\"Rank {i+1} (score: {score:.2f}) (doc_id: {doc_id}): {doc}\")\n```\n\n## Performance\n\n`splade-index` with a `numba` backend gives `45%` faster query time on average than the [pyseismic-lsr](https://github.com/TusKANNy/seismic) library, which is \"an Efficient Inverted Index for Approximate Retrieval\", all while `splade-index` does exact retrieval with no approximations involved. \n\nThe query latency values shown include the query encoding times using the `naver/splade-v3-distilbert` SPLADE sparse encoder model.  \n\n|Library|Latency per query (in miliseconds)|\n|:-|:-|\n|`splade-index` (with `numba` backend)|**1.77 ms**|\n|`splade-index` (with `numpy` backend)|2.44 ms|\n|`splade-index` (with `pytorch` backend)|2.61 ms|\n|`pyseismic-lsr`|3.24 ms|\n\nThe tests were conducted using **`100,231`** documents and **`5,000`** queries from the [sentence-transformers/natural-questions](https://huggingface.co/datasets/sentence-transformers/natural-questions) dataset, and an NVIDIA Tesla T4 16GB GPU on Google Colab. \n\n## Examples\n\n- [`splade_index_usage_example.ipynb`](examples/splade_index_usage_example.ipynb) to index and query `1,000` documents on a cpu.\n\n- [`indexing_and_querying_100k_docs_with_gpu.ipynb`](examples/indexing_and_querying_100k_docs_with_gpu.ipynb) to index and query a `100,000` documents on a gpu.\n\n### SPLADE Models\n\nYou can use SPLADE-Index with any splade model from huggingface hub such as the ones below.\n\n||Size (# Params)|MSMARCO MRR@10|BEIR-13 avg nDCG@10|\n|:---|:----|:-------------------|:------------------|\n|[naver/splade-v3](https://huggingface.co/naver/splade-v3)|110M|40.2|51.7|\n|[naver/splade-v3-distilbert](https://huggingface.co/naver/splade-v3-distilbert)|67.0M|38.7|50.0|\n|[rasyosef/splade-small](https://huggingface.co/rasyosef/splade-small)|28.8M|35.4|46.6|\n|[rasyosef/splade-mini](https://huggingface.co/rasyosef/splade-mini)|11.2M|34.1|44.5|\n|[rasyosef/splade-tiny](https://huggingface.co/rasyosef/splade-tiny)|4.4M|30.9|40.6|\n\n## Acknowledgement\n`splade-index` was built on top of the [bm25s](https://github.com/xhluca/bm25s) library, and makes use of its excellent inverted index impementation, originally used by `bm25s` for its many variants of the BM25 ranking algorithm. \n\n\u003c!-- ## Citation\n\nYou can refer to the library with this BibTeX:\n\n```bibtex\n@misc{SPLADE-Index,\n  title={SPLADE-Index: A Fast Inverted Search Index for SPLADE Sparse Retrieval Models},\n  author={Yosef Worku Alemneh},\n  url={https://github.com/rasyosef/splade-index},\n  year={2025}\n} \n``` --\u003e","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frasyosef%2Fsplade-index","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frasyosef%2Fsplade-index","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frasyosef%2Fsplade-index/lists"}