{"id":31581528,"url":"https://github.com/rasyosef/splade-tiny-msmarco","last_synced_at":"2026-05-17T17:01:52.580Z","repository":{"id":307313961,"uuid":"1022851414","full_name":"rasyosef/splade-tiny-msmarco","owner":"rasyosef","description":"Python code to train SPLADE sparse retrieval models based on BERT-Tiny (4M) and BERT-Mini (11M) by distilling a Cross-Encoder on the MSMARCO dataset","archived":false,"fork":false,"pushed_at":"2025-09-08T04:37:18.000Z","size":20,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-05T22:00:58.819Z","etag":null,"topics":["distillation","information-retrieval","msmarco","neural-information-retrieval","pytorch","sentence-transformers","sparse-retrieval","splade"],"latest_commit_sha":null,"homepage":"https://huggingface.co/collections/rasyosef/splade-tiny-msmarco-687c548c0691d95babf65b70","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rasyosef.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-20T01:20:15.000Z","updated_at":"2025-09-08T04:37:20.000Z","dependencies_parsed_at":"2025-07-30T17:02:59.001Z","dependency_job_id":"83ff24f1-7c91-4242-beaf-b4c25577e2fb","html_url":"https://github.com/rasyosef/splade-tiny-msmarco","commit_stats":null,"previous_names":["rasyosef/splade-tiny-msmarco"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rasyosef/splade-tiny-msmarco","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasyosef%2Fsplade-tiny-msmarco","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasyosef%2Fsplade-tiny-msmarco/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasyosef%2Fsplade-tiny-msmarco/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasyosef%2Fsplade-tiny-msmarco/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rasyosef","download_url":"https://codeload.github.com/rasyosef/splade-tiny-msmarco/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasyosef%2Fsplade-tiny-msmarco/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33147339,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-17T09:28:26.183Z","status":"ssl_error","status_checked_at":"2026-05-17T09:27:52.702Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distillation","information-retrieval","msmarco","neural-information-retrieval","pytorch","sentence-transformers","sparse-retrieval","splade"],"created_at":"2025-10-05T21:59:18.348Z","updated_at":"2026-05-17T17:01:52.537Z","avatar_url":"https://github.com/rasyosef.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SPLADE Tiny MSMARCO\n\nThis repo contains Python code to train SPLADE sparse retrieval models based on BERT-Tiny (4M params), BERT-Mini (11M params), and BERT-Small (28.8M params) by distilling a Cross-Encoder on the MSMARCO dataset. The cross-encoder used was [ms-marco-MiniLM-L6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2). \n\nThe tiny SPLADE models beat `BM25` by `65.6 - 89.4%` on the MSMARCO benchmark. Even though `splade-mini` and `splade-tiny` are `6-15x` smaller than Naver's official [splade-v3-distilbert](https://huggingface.co/naver/splade-v3-distilbert), they retain `80-88%` of it's performance on MSMARCO, all while producing sparser embedding vectors with up to `45%` fewer active dimensions. `splade-mini` even beats the `6x` larger `naver/splade_v2_max` on the MSMARCO benchmark.\n\nThe tiny SPLADE models are small enough to be used without a GPU on a dataset of a few thousand documents. \n\nYou can download the models from the following huggingface collection.\n\n- Models: https://huggingface.co/collections/rasyosef/splade-tiny-msmarco-687c548c0691d95babf65b70\n- Distillation Dataset: https://huggingface.co/datasets/yosefw/msmarco-train-distil-v2\n\n## Performance\n\nThe splade models were evaluated on 55 thousand queries and 8.84 million documents from the [MSMARCO](https://huggingface.co/datasets/microsoft/ms_marco) dataset.\n\n||Size (# Params)|Embedding Type|MSMARCO MRR@10|Recall@10|Corpus Active Dims|\n|:-|:------------|:-------------|:-------------|:--------|:-----------------|\n|**BM25**|-|-|18.0|37.8|-|\n|**[rasyosef/splade-tiny](https://huggingface.co/rasyosef/splade-tiny)**|4.4M|sparse|30.9|55.4|127.1|\n|**[rasyosef/splade-mini](https://huggingface.co/rasyosef/splade-mini)**|11.2M|sparse|34.1|60.3|186.6|\n|**[rasyosef/splade-small](https://huggingface.co/rasyosef/splade-small)**|28.8M|sparse|35.4|62.4|176.9|\n|**[naver/splade-v3-distilbert](https://huggingface.co/naver/splade-v3-distilbert)**|67.0M|sparse|38.7|66.8|192.3|\n\nHere are a few Dense Embedding models evaluated for comparison\n\n||Size (# Params)|Embedding Type|MSMARCO MRR@10|Recall@10|Embedding Dims|\n|:-|:------------|:-------------|:-------------|:--------|:-------------|\n|**Snowflake/snowflake-arctic-embed-s**|33.2M |dense|33.7|60.7|384|\n|**intfloat/e5-small-v2**|33.4M|dense|34.4|61.8|384|\n|**Snowflake/snowflake-arctic-embed-m-v1.5**|109.0M|dense|35.2|63.6|768|\n\n## Sample Inference Code\n\n### Direct Usage (Sentence Transformers)\n\nFirst install the Sentence Transformers library:\n\n```bash\npip install -U sentence-transformers\n```\n\nThen you can load this model and run inference.\n\n```python\nfrom sentence_transformers import SparseEncoder\n\n# Download from the 🤗 Hub\nmodel = SparseEncoder(\"rasyosef/splade-tiny\")\n\n# Run inference\nsentences = [\n    \"The weather is lovely today.\",\n    \"It's so sunny outside!\",\n    \"He drove to the stadium.\",\n]\nembeddings = model.encode(sentences)\nprint(embeddings.shape)\n# (3, 30522)\n\n# Get the similarity scores for the embeddings\nsimilarities = model.similarity(embeddings, embeddings)\nprint(similarities)\n# tensor([[39.7253,  7.1662,  0.0000],\n#         [ 7.1662, 27.0255,  0.1385],\n#         [ 0.0000,  0.1385, 26.3539]])\n\n# Let's decode our embeddings to be able to interpret them\ndecoded = model.decode(embeddings, top_k=10)\nfor decoded, sentence in zip(decoded, sentences):\n    print(f\"Sentence: {sentence}\")\n    print(f\"Decoded: {decoded}\")\n    print()\n```\n\n```\nSentence: The weather is lovely today.\nDecoded: [('today', 2.543731451034546), ('lovely', 2.1207380294799805), ('weather', 2.043243646621704), ('summers', 2.0363612174987793), ('cool', 1.8053990602493286), ('darling', 1.4539366960525513), ('now', 1.3975915908813477), ('beautiful', 1.3838205337524414), ('nice', 1.2771646976470947), ('worthy', 1.2120126485824585)]\n\nSentence: It's so sunny outside!\nDecoded: [('outside', 2.2667503356933594), ('sunny', 2.188624382019043), ('cool', 1.8421072959899902), ('so', 1.8326992988586426), ('ahead', 1.439140796661377), ('darling', 1.3871415853500366), ('it', 1.2396169900894165), ('across', 0.9793394804000854), ('sunshine', 0.9226517081260681), ('rocky', 0.8372038006782532)]\n\nSentence: He drove to the stadium.\nDecoded: [('drove', 2.0859971046447754), ('stadium', 2.0446298122406006), ('he', 1.7063332796096802), ('team', 1.4266990423202515), ('move', 1.3472365140914917), ('jumped', 1.1752349138259888), ('driving', 1.1558808088302612), ('ride', 1.1327213048934937), ('run', 1.0909342765808105), ('drive', 1.0640281438827515)]\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frasyosef%2Fsplade-tiny-msmarco","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frasyosef%2Fsplade-tiny-msmarco","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frasyosef%2Fsplade-tiny-msmarco/lists"}