{"id":37074739,"url":"https://github.com/bizreach-inc/light-splade","last_synced_at":"2026-01-14T08:47:46.539Z","repository":{"id":318848362,"uuid":"1071224557","full_name":"bizreach-inc/light-splade","owner":"bizreach-inc","description":"Provides a minimal PyTorch implementation of SPLADE","archived":false,"fork":false,"pushed_at":"2026-01-05T02:34:08.000Z","size":3057,"stargazers_count":14,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-13T10:20:35.412Z","etag":null,"topics":["information-retreival","neural-ir","nlp","sparse-retrieval","splade"],"latest_commit_sha":null,"homepage":"https://bizreach-inc.github.io/light-splade/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bizreach-inc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-07T03:49:50.000Z","updated_at":"2026-01-04T13:04:40.000Z","dependencies_parsed_at":"2025-10-16T18:54:41.884Z","dependency_job_id":"ef095db7-d924-4497-a78c-0c5adf4a17b0","html_url":"https://github.com/bizreach-inc/light-splade","commit_stats":null,"previous_names":["bizreach-inc/light-splade"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bizreach-inc/light-splade","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bizreach-inc%2Flight-splade","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bizreach-inc%2Flight-splade/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bizreach-inc%2Flight-splade/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bizreach-inc%2Flight-splade/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bizreach-inc","download_url":"https://codeload.github.com/bizreach-inc/light-splade/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bizreach-inc%2Flight-splade/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28414693,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:38:59.149Z","status":"ssl_error","status_checked_at":"2026-01-14T08:38:43.588Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["information-retreival","neural-ir","nlp","sparse-retrieval","splade"],"created_at":"2026-01-14T08:47:45.851Z","updated_at":"2026-01-14T08:47:46.522Z","avatar_url":"https://github.com/bizreach-inc.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!---\nCopyright 2025 BizReach, Inc. All rights reserved.\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n--\u003e\n\n# light-splade\n\n`light-splade` provides a minimal yet extensible PyTorch implementation of `SPLADE`, a family of sparse neural retrievers that expand queries and documents into interpretable sparse representations.\n\nUnlike dense retrievers, SPLADE produces `sparse vectors in the vocabulary space`, making it both `efficient to index` with standard IR engines (e.g., Lucene, Elasticsearch) and `interpretable`, while achieving strong retrieval effectiveness. It was first introduced in the paper “[SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking](https://arxiv.org/abs/2107.05720)”.\n\nThis repository is designed for\n\n- Practitioners wanting to `train SPLADE models on custom corpora`.\n- Developers experimenting with `sparse lexical expansion` at scale.\n- Researchers looking for a `reference implementation`.\n\nWe currently support `SPLADE v2` and `SPLADE++`\n\n## Features\n- Training pipeline for SPLADE using PyTorch + HuggingFace Transformers.\n- Support for `distillation training` from dense retrievers (e.g., ColBERT, dense BERT).\n- Export trained models into sparse representations compatible with IR systems.\n- Simple, lightweight, and easy to extend for experiments.\n\n## Installation\n\n```\npip install light-splade\n```\n\n\n\n\n## Quickstart\n\nThe following code uses [bizreach-inc/light-splade-japanese-28M](https://huggingface.co/bizreach-inc/light-splade-japanese-28M), an open SPLADE model for Japanese.\n\n- **Convert text to sparse vector with SPLADE model using this package**\n\n\n```python\nimport torch\nfrom light_splade import SpladeEncoder\n\n# Initialize the encoder\nencoder = SpladeEncoder(model_path=\"bizreach-inc/light-splade-japanese-28M\")\n\n# Tokenize input text\ncorpus = [\n    \"日本の首都は東京です。\",\n    \"大阪万博は2025年に開催されます。\"\n]\n\n# Generate sparse representation\nwith torch.inference_mode():\n    embeddings = encoder.encode(corpus)\n    sparse_vecs = encoder.to_sparse(embeddings)\n\nprint(sparse_vecs[0])\nprint(sparse_vecs[1])\n```\n\n- **Convert text to sparse vector with SPLADE model using `transformers` package**\n\nInstall required packages\n\n```\npip install fugashi torch transformers unidic-lite\n```\n\nThen execute the following Python code\n\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForMaskedLM\n\n\ndef dense_to_sparse(dense: torch.tensor, idx2token: dict[int, str]) -\u003e list[dict[str, float]]:\n    rows, cols = dense.nonzero(as_tuple=True)\n    rows = rows.tolist()\n    cols = cols.tolist()\n    weights = dense[rows, cols].tolist()\n\n    sparse_vecs = [{} for _ in range(dense.size(0))]\n    for row, col, weight in zip(rows, cols, weights):\n        sparse_vecs[row][idx2token[col]] = round(weight, 2)\n\n    for i in range(len(sparse_vecs)):\n        sparse_vecs[i] = dict(sorted(sparse_vecs[i].items(), key=lambda x: x[1], reverse=True))\n    return sparse_vecs\n\n\nMODEL_PATH = \"bizreach-inc/light-splade-japanese-28M\"\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\ntransformer = AutoModelForMaskedLM.from_pretrained(MODEL_PATH).to(device)\ntokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)\nidx2token = {idx: token for token, idx in tokenizer.get_vocab().items()}\n\ncorpus = [\n    \"日本の首都は東京です。\",\n    \"大阪万博は2025年に開催されます。\"\n]\ntoken_outputs = tokenizer(corpus, padding=True, return_tensors=\"pt\")\nattention_mask = token_outputs[\"attention_mask\"].to(device)\ntoken_outputs = {key: value.to(device) for key, value in token_outputs.items()}\n\nwith torch.inference_mode():\n    outputs = transformer(**token_outputs)\n    dense, _ = torch.max(\n        torch.log(1 + torch.relu(outputs.logits)) * attention_mask.unsqueeze(-1),\n        dim=1,\n    )\nsparse_vecs = dense_to_sparse(dense, idx2token)\n\nprint(sparse_vecs[0])\nprint(sparse_vecs[1])\n```\n\n- **Output**\n\n```python\n{'首都': 1.83, '日本': 1.82, '東京': 1.78, '中立': 0.73, '都会': 0.69, '駒': 0.68, '州都': 0.67, '首相': 0.64, '足立': 0.62, 'です': 0.61, '都市': 0.54, 'ユニ': 0.54, '京都': 0.52, '国': 0.51, '発表': 0.49, '成田': 0.48, '太陽': 0.45, '藤原': 0.45, '私立': 0.42, '王国': 0.4...}\n{'202': 1.61, '開催': 1.49, '大阪': 1.34, '万博': 1.19, '東京': 1.15, '年': 1.1, 'いつ': 1.05, '##5': 1.03, '203': 0.86, '月': 0.8, '期間': 0.79, '高槻': 0.79, '京都': 0.7, '神戸': 0.62, '2024': 0.54, '夢': 0.52, '206': 0.52, '姫路': 0.51, '行わ': 0.49, 'こう': 0.49, '芸術': 0.48...}\n```\n\n\n## Setup for fine-tuning a SPLADE model\n\n- Python 3.11+.\n- Recommended: use the `uv` tool to manage the virtual environment (see [Getting started](docs/getting_started.md) document).\n\nQuick setup (recommended):\n\n```bash\ngit clone https://github.com/bizreach-inc/light-splade.git\ncd light-splade\n# create and activate virtual env using uv\nuv venv --seed .venv\nsource .venv/bin/activate\nuv sync\n```\n\nFor developer checks, run:\n\n```bash\nuv run pre-commit run --all-files\nuv run pytest\n```\n\n\n## Train SPLADE with toy dataset (triplet-based)\n- `uv run examples/run_train_splade_triplet.py --config-name toy_splade_ja`\n- To run on an environment without GPU, see this [trouble shooting](docs/trouble_shooting.md#running-the-training-script-on-cpu-only-machines)\n\nFor full run instructions using `uv` and `Docker` commands, see [Getting started](docs/getting_started.md).\n\n## Input Data format\n\nDetailed data format docs:\n\n- [Triplet format](docs/splade_triplet_data_format.md) (`SPLADE v2`)\n- [Distillation format](docs/splade_triplet_distil_data_format.md) (`SPLADE++` or `SPLADE v2bis`)\n\n## References\n\n- [SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval](https://arxiv.org/abs/2109.10086). arxiv (SPLADE v2)\n  - Thibault Formal, Benjamin Piwowarski, Carlos Lassance, Stéphane Clinchant.\n\n- [From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective](http://arxiv.org/abs/2205.04733). SIGIR22 short paper (SPLADE++ or SPLADE v2bis)\n  - Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant.\n\n- For `transformers` docs:\n  - [Trainer docs (transformers v4.56.1)](https://huggingface.co/docs/transformers/v4.56.1/en/main_classes/trainer)\n  - [TrainingArguments docs (transformers v4.56.1)](https://huggingface.co/docs/transformers/v4.56.1/en/main_classes/trainer#transformers.TrainingArguments)\n\n\n## License\n\nThis project is licensed under the Apache License, Version 2.0 — see the `LICENSE` file for details.\n\nCopyright 2025 BizReach, Inc.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbizreach-inc%2Flight-splade","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbizreach-inc%2Flight-splade","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbizreach-inc%2Flight-splade/lists"}