https://github.com/rasyosef/splade-tiny-msmarco

Python code to train SPLADE sparse retrieval models based on BERT-Tiny (4M) and BERT-Mini (11M) by distilling a Cross-Encoder on the MSMARCO dataset
https://github.com/rasyosef/splade-tiny-msmarco

distillation information-retrieval msmarco neural-information-retrieval pytorch sentence-transformers sparse-retrieval splade

Last synced: about 1 month ago
JSON representation

Python code to train SPLADE sparse retrieval models based on BERT-Tiny (4M) and BERT-Mini (11M) by distilling a Cross-Encoder on the MSMARCO dataset

Host: GitHub
URL: https://github.com/rasyosef/splade-tiny-msmarco
Owner: rasyosef
Created: 2025-07-20T01:20:15.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-09-08T04:37:18.000Z (9 months ago)
Last Synced: 2025-10-05T22:00:58.819Z (8 months ago)
Topics: distillation, information-retrieval, msmarco, neural-information-retrieval, pytorch, sentence-transformers, sparse-retrieval, splade
Language: Jupyter Notebook
Homepage: https://huggingface.co/collections/rasyosef/splade-tiny-msmarco-687c548c0691d95babf65b70
Size: 19.5 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # SPLADE Tiny MSMARCO

This repo contains Python code to train SPLADE sparse retrieval models based on BERT-Tiny (4M params), BERT-Mini (11M params), and BERT-Small (28.8M params) by distilling a Cross-Encoder on the MSMARCO dataset. The cross-encoder used was [ms-marco-MiniLM-L6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2). 

The tiny SPLADE models beat `BM25` by `65.6 - 89.4%` on the MSMARCO benchmark. Even though `splade-mini` and `splade-tiny` are `6-15x` smaller than Naver's official [splade-v3-distilbert](https://huggingface.co/naver/splade-v3-distilbert), they retain `80-88%` of it's performance on MSMARCO, all while producing sparser embedding vectors with up to `45%` fewer active dimensions. `splade-mini` even beats the `6x` larger `naver/splade_v2_max` on the MSMARCO benchmark.

The tiny SPLADE models are small enough to be used without a GPU on a dataset of a few thousand documents. 

You can download the models from the following huggingface collection.

- Models: https://huggingface.co/collections/rasyosef/splade-tiny-msmarco-687c548c0691d95babf65b70

- Distillation Dataset: https://huggingface.co/datasets/yosefw/msmarco-train-distil-v2

## Performance

The splade models were evaluated on 55 thousand queries and 8.84 million documents from the [MSMARCO](https://huggingface.co/datasets/microsoft/ms_marco) dataset.

||Size (# Params)|Embedding Type|MSMARCO MRR@10|Recall@10|Corpus Active Dims|

|:-|:------------|:-------------|:-------------|:--------|:-----------------|

|**BM25**|-|-|18.0|37.8|-|

|**[rasyosef/splade-tiny](https://huggingface.co/rasyosef/splade-tiny)**|4.4M|sparse|30.9|55.4|127.1|

|**[rasyosef/splade-mini](https://huggingface.co/rasyosef/splade-mini)**|11.2M|sparse|34.1|60.3|186.6|

|**[rasyosef/splade-small](https://huggingface.co/rasyosef/splade-small)**|28.8M|sparse|35.4|62.4|176.9|

|**[naver/splade-v3-distilbert](https://huggingface.co/naver/splade-v3-distilbert)**|67.0M|sparse|38.7|66.8|192.3|

Here are a few Dense Embedding models evaluated for comparison

||Size (# Params)|Embedding Type|MSMARCO MRR@10|Recall@10|Embedding Dims|

|:-|:------------|:-------------|:-------------|:--------|:-------------|

|**Snowflake/snowflake-arctic-embed-s**|33.2M |dense|33.7|60.7|384|

|**intfloat/e5-small-v2**|33.4M|dense|34.4|61.8|384|

|**Snowflake/snowflake-arctic-embed-m-v1.5**|109.0M|dense|35.2|63.6|768|

## Sample Inference Code

### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash

pip install -U sentence-transformers

```

Then you can load this model and run inference.

```python

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub

model = SparseEncoder("rasyosef/splade-tiny")

# Run inference

sentences = [

    "The weather is lovely today.",

    "It's so sunny outside!",

    "He drove to the stadium.",

]

embeddings = model.encode(sentences)

print(embeddings.shape)

# (3, 30522)

# Get the similarity scores for the embeddings

similarities = model.similarity(embeddings, embeddings)

print(similarities)

# tensor([[39.7253,  7.1662,  0.0000],

#         [ 7.1662, 27.0255,  0.1385],

#         [ 0.0000,  0.1385, 26.3539]])

# Let's decode our embeddings to be able to interpret them

decoded = model.decode(embeddings, top_k=10)

for decoded, sentence in zip(decoded, sentences):

    print(f"Sentence: {sentence}")

    print(f"Decoded: {decoded}")

    print()

```

```

Sentence: The weather is lovely today.

Decoded: [('today', 2.543731451034546), ('lovely', 2.1207380294799805), ('weather', 2.043243646621704), ('summers', 2.0363612174987793), ('cool', 1.8053990602493286), ('darling', 1.4539366960525513), ('now', 1.3975915908813477), ('beautiful', 1.3838205337524414), ('nice', 1.2771646976470947), ('worthy', 1.2120126485824585)]

Sentence: It's so sunny outside!

Decoded: [('outside', 2.2667503356933594), ('sunny', 2.188624382019043), ('cool', 1.8421072959899902), ('so', 1.8326992988586426), ('ahead', 1.439140796661377), ('darling', 1.3871415853500366), ('it', 1.2396169900894165), ('across', 0.9793394804000854), ('sunshine', 0.9226517081260681), ('rocky', 0.8372038006782532)]

Sentence: He drove to the stadium.

Decoded: [('drove', 2.0859971046447754), ('stadium', 2.0446298122406006), ('he', 1.7063332796096802), ('team', 1.4266990423202515), ('move', 1.3472365140914917), ('jumped', 1.1752349138259888), ('driving', 1.1558808088302612), ('ride', 1.1327213048934937), ('run', 1.0909342765808105), ('drive', 1.0640281438827515)]

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rasyosef/splade-tiny-msmarco

Awesome Lists containing this project

README