https://github.com/bizreach-inc/light-splade

Provides a minimal PyTorch implementation of SPLADE
https://github.com/bizreach-inc/light-splade

information-retreival neural-ir nlp sparse-retrieval splade

Last synced: 5 months ago
JSON representation

Provides a minimal PyTorch implementation of SPLADE

Host: GitHub
URL: https://github.com/bizreach-inc/light-splade
Owner: bizreach-inc
License: apache-2.0
Created: 2025-10-07T03:49:50.000Z (8 months ago)
Default Branch: main
Last Pushed: 2026-01-05T02:34:08.000Z (5 months ago)
Last Synced: 2026-01-13T10:20:35.412Z (5 months ago)
Topics: information-retreival, neural-ir, nlp, sparse-retrieval, splade
Language: Python
Homepage: https://bizreach-inc.github.io/light-splade/
Size: 2.92 MB
Stars: 14
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS

Awesome Lists containing this project

README

          

# light-splade

`light-splade` provides a minimal yet extensible PyTorch implementation of `SPLADE`, a family of sparse neural retrievers that expand queries and documents into interpretable sparse representations.

Unlike dense retrievers, SPLADE produces `sparse vectors in the vocabulary space`, making it both `efficient to index` with standard IR engines (e.g., Lucene, Elasticsearch) and `interpretable`, while achieving strong retrieval effectiveness. It was first introduced in the paper “[SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking](https://arxiv.org/abs/2107.05720)”.

This repository is designed for

- Practitioners wanting to `train SPLADE models on custom corpora`.

- Developers experimenting with `sparse lexical expansion` at scale.

- Researchers looking for a `reference implementation`.

We currently support `SPLADE v2` and `SPLADE++`

## Features

- Training pipeline for SPLADE using PyTorch + HuggingFace Transformers.

- Support for `distillation training` from dense retrievers (e.g., ColBERT, dense BERT).

- Export trained models into sparse representations compatible with IR systems.

- Simple, lightweight, and easy to extend for experiments.

## Installation

```

pip install light-splade

```

## Quickstart

The following code uses [bizreach-inc/light-splade-japanese-28M](https://huggingface.co/bizreach-inc/light-splade-japanese-28M), an open SPLADE model for Japanese.

- **Convert text to sparse vector with SPLADE model using this package**

```python

import torch

from light_splade import SpladeEncoder

# Initialize the encoder

encoder = SpladeEncoder(model_path="bizreach-inc/light-splade-japanese-28M")

# Tokenize input text

corpus = [

    "日本の首都は東京です。",

    "大阪万博は2025年に開催されます。"

]

# Generate sparse representation

with torch.inference_mode():

    embeddings = encoder.encode(corpus)

    sparse_vecs = encoder.to_sparse(embeddings)

print(sparse_vecs[0])

print(sparse_vecs[1])

```

- **Convert text to sparse vector with SPLADE model using `transformers` package**

Install required packages

```

pip install fugashi torch transformers unidic-lite

```

Then execute the following Python code

```python

import torch

from transformers import AutoTokenizer, AutoModelForMaskedLM

def dense_to_sparse(dense: torch.tensor, idx2token: dict[int, str]) -> list[dict[str, float]]:

    rows, cols = dense.nonzero(as_tuple=True)

    rows = rows.tolist()

    cols = cols.tolist()

    weights = dense[rows, cols].tolist()

    sparse_vecs = [{} for _ in range(dense.size(0))]

    for row, col, weight in zip(rows, cols, weights):

        sparse_vecs[row][idx2token[col]] = round(weight, 2)

    for i in range(len(sparse_vecs)):

        sparse_vecs[i] = dict(sorted(sparse_vecs[i].items(), key=lambda x: x[1], reverse=True))

    return sparse_vecs

MODEL_PATH = "bizreach-inc/light-splade-japanese-28M"

device = "cuda" if torch.cuda.is_available() else "cpu"

transformer = AutoModelForMaskedLM.from_pretrained(MODEL_PATH).to(device)

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

idx2token = {idx: token for token, idx in tokenizer.get_vocab().items()}

corpus = [

    "日本の首都は東京です。",

    "大阪万博は2025年に開催されます。"

]

token_outputs = tokenizer(corpus, padding=True, return_tensors="pt")

attention_mask = token_outputs["attention_mask"].to(device)

token_outputs = {key: value.to(device) for key, value in token_outputs.items()}

with torch.inference_mode():

    outputs = transformer(**token_outputs)

    dense, _ = torch.max(

        torch.log(1 + torch.relu(outputs.logits)) * attention_mask.unsqueeze(-1),

        dim=1,

    )

sparse_vecs = dense_to_sparse(dense, idx2token)

print(sparse_vecs[0])

print(sparse_vecs[1])

```

- **Output**

```python

{'首都': 1.83, '日本': 1.82, '東京': 1.78, '中立': 0.73, '都会': 0.69, '駒': 0.68, '州都': 0.67, '首相': 0.64, '足立': 0.62, 'です': 0.61, '都市': 0.54, 'ユニ': 0.54, '京都': 0.52, '国': 0.51, '発表': 0.49, '成田': 0.48, '太陽': 0.45, '藤原': 0.45, '私立': 0.42, '王国': 0.4...}

{'202': 1.61, '開催': 1.49, '大阪': 1.34, '万博': 1.19, '東京': 1.15, '年': 1.1, 'いつ': 1.05, '##5': 1.03, '203': 0.86, '月': 0.8, '期間': 0.79, '高槻': 0.79, '京都': 0.7, '神戸': 0.62, '2024': 0.54, '夢': 0.52, '206': 0.52, '姫路': 0.51, '行わ': 0.49, 'こう': 0.49, '芸術': 0.48...}

```

## Setup for fine-tuning a SPLADE model

- Python 3.11+.

- Recommended: use the `uv` tool to manage the virtual environment (see [Getting started](docs/getting_started.md) document).

Quick setup (recommended):

```bash

git clone https://github.com/bizreach-inc/light-splade.git

cd light-splade

# create and activate virtual env using uv

uv venv --seed .venv

source .venv/bin/activate

uv sync

```

For developer checks, run:

```bash

uv run pre-commit run --all-files

uv run pytest

```

## Train SPLADE with toy dataset (triplet-based)

- `uv run examples/run_train_splade_triplet.py --config-name toy_splade_ja`

- To run on an environment without GPU, see this [trouble shooting](docs/trouble_shooting.md#running-the-training-script-on-cpu-only-machines)

For full run instructions using `uv` and `Docker` commands, see [Getting started](docs/getting_started.md).

## Input Data format

Detailed data format docs:

- [Triplet format](docs/splade_triplet_data_format.md) (`SPLADE v2`)

- [Distillation format](docs/splade_triplet_distil_data_format.md) (`SPLADE++` or `SPLADE v2bis`)

## References

- [SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval](https://arxiv.org/abs/2109.10086). arxiv (SPLADE v2)

  - Thibault Formal, Benjamin Piwowarski, Carlos Lassance, Stéphane Clinchant.

- [From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective](http://arxiv.org/abs/2205.04733). SIGIR22 short paper (SPLADE++ or SPLADE v2bis)

  - Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant.

- For `transformers` docs:

  - [Trainer docs (transformers v4.56.1)](https://huggingface.co/docs/transformers/v4.56.1/en/main_classes/trainer)

  - [TrainingArguments docs (transformers v4.56.1)](https://huggingface.co/docs/transformers/v4.56.1/en/main_classes/trainer#transformers.TrainingArguments)

## License

This project is licensed under the Apache License, Version 2.0 — see the `LICENSE` file for details.

Copyright 2025 BizReach, Inc.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bizreach-inc/light-splade

Awesome Lists containing this project

README