{"id":46481413,"url":"https://github.com/gedankrayze/splade-model-trainer","last_synced_at":"2026-03-06T08:16:23.043Z","repository":{"id":287441652,"uuid":"958547569","full_name":"gedankrayze/splade-model-trainer","owner":"gedankrayze","description":"A comprehensive toolkit for training, evaluating, and deploying SPLADE models","archived":false,"fork":false,"pushed_at":"2026-02-14T11:30:34.000Z","size":283,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-14T19:59:30.369Z","etag":null,"topics":["model-training","model-training-and-evaluation","splade","splade-model"],"latest_commit_sha":null,"homepage":"https://gedankrayze.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gedankrayze.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-01T11:28:58.000Z","updated_at":"2026-02-14T11:30:36.000Z","dependencies_parsed_at":"2025-04-11T19:36:35.078Z","dependency_job_id":null,"html_url":"https://github.com/gedankrayze/splade-model-trainer","commit_stats":null,"previous_names":["gedankrayze/splade-model-trainer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gedankrayze/splade-model-trainer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gedankrayze%2Fsplade-model-trainer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gedankrayze%2Fsplade-model-trainer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gedankrayze%2Fsplade-model-trainer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gedankrayze%2Fsplade-model-trainer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gedankrayze","download_url":"https://codeload.github.com/gedankrayze/splade-model-trainer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gedankrayze%2Fsplade-model-trainer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30167208,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T07:56:45.623Z","status":"ssl_error","status_checked_at":"2026-03-06T07:55:55.621Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["model-training","model-training-and-evaluation","splade","splade-model"],"created_at":"2026-03-06T08:16:22.553Z","updated_at":"2026-03-06T08:16:23.036Z","avatar_url":"https://github.com/gedankrayze.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Gedank Rayze SPLADE Model Trainer\n\nA comprehensive toolkit for training, evaluating, and deploying SPLADE (SParse Lexical AnD Expansion) models for\nefficient information retrieval.\n\n## Overview\n\nSPLADE is a state-of-the-art approach to information retrieval that combines the efficiency of sparse retrievers with\nthe effectiveness of neural language models. The SPLADE model uses a sparse representation that captures lexical\nmatching while also handling term expansion, making it powerful for search applications.\n\nWhile our primary focus is on SPLADE models, we also provide complementary support for training dense embedding models and hybrid approaches that can be used alongside SPLADE for certain use cases.\n\n## Project Structure\n\n- `src/` - Source code for the SPLADE model trainer\n- `tests/code/` - Unit tests and integration tests\n- `docs/` - Documentation for the project\n- `articles/` - Articles and blog posts about SPLADE and usage of this toolkit\n- `fine_tuned_*/` - Output directories for trained models (not included in version control)\n\n## New Features\n\n### Domain Distiller\n\nThe new Domain Distiller tool allows you to generate domain-specific training data for SPLADE models from scratch using LLMs. Key features include:\n\n- **Zero-Shot Training Data Generation**: Create training data for any domain without pre-existing datasets\n- **Domain Bootstrapping**: Automatically generate domain knowledge, terminology, and concepts\n- **Contrastive Pair Generation**: Create high-quality negative examples using advanced contrastive strategies\n- **Multi-Language Support**: Generate data in English, German, Spanish, French, and more\n- **OpenAI-Compatible API Support**: Works with OpenAI, Anthropic, or any compatible API endpoint\n\nQuick start with Domain Distiller:\n\n```bash\n# Generate domain-specific training data\npython -m src.domain_distiller.cli pipeline --domain legal --language en --queries 100 --contrastive\n\n# Train a SPLADE model with the generated data\npython train_splade_unified.py --train-file ./distilled_data/legal_en_splade.json --output-dir ./fine_tuned_splade\n```\n\nSee [docs/domain_distiller.md](docs/domain_distiller.md) for detailed documentation and [docs/contrastive_generation.md](docs/contrastive_generation.md) for information about contrastive pair generation.\n\n### Custom Templates\n\nThe toolkit now supports custom templates for generating domain-specific training data:\n\n```bash\n# Using a built-in template\npython -m src.generate_training_data \\\n  --input-dir ./documents \\\n  --output-file training_data.json \\\n  --template legal\n\n# Using a custom template file\npython -m src.generate_training_data \\\n  --input-dir ./documents \\\n  --output-file training_data.json \\\n  --template ./templates/my_custom_template.json\n```\n\nSee [docs/custom_templates.md](docs/custom_templates.md) for detailed documentation on creating and using custom templates.\n\n## Quick Start\n\n### Installation\n\n```bash\npip install -r requirements.txt\n```\n\n### Training a Model with the Unified Trainer\n\n```bash\npython train_splade_unified.py --train-file training_data.json --output-dir ./fine_tuned_splade --mixed-precision\n```\n\nThe unified trainer provides a comprehensive solution that uses tools from the `src/unified` folder, combining all advanced features in a single, cohesive interface. It offers:\n\n- Mixed precision training for better performance\n- Early stopping to prevent overfitting\n- Checkpointing for saving/resuming training\n- Training recovery options\n- Comprehensive logging and metrics tracking\n- Support for multiple hardware platforms (CUDA, MPS, CPU)\n\nSee [docs/unified_trainer.md](docs/unified_trainer.md) for detailed documentation and advanced options.\n\n### Using Task Runner with Enhanced Documentation\n\nWe provide extensively documented Taskfiles that simplify common operations and automatically handle virtual environment activation:\n\n```bash\n# Install Task runner: https://taskfile.dev/installation/\n\n# Generate training data with a custom template\ntask generate input_dir=./documents output_file=training.json template=legal language=de\n\n# Train a model with the generated data\ntask train train_file=training.json output_dir=./fine_tuned_splade\n\n# Generate language-specific data using OpenAI\ntask train:prepare-with-openai-lang folder=./documents model=gpt-4o lang=es template=legal\n```\n\nEach task comes with detailed documentation and examples. Use `task -l` to list all available tasks.\n\n### Interactive Search\n\n```bash\npython -m tests.code.test_queries --model-dir ./fine_tuned_splade --docs-file documents.json\n```\n\n## Documentation\n\nFor detailed documentation, see the [docs/README.md](docs/README.md) file.\n\nFor best practices on training and using SPLADE models, see [docs/best_practices.md](docs/best_practices.md).\n\nFor the unified trainer documentation, see [docs/unified_trainer.md](docs/unified_trainer.md).\n\nFor information about our CI/CD setup and GitHub Actions workflows, see [docs/ci-cd/github-actions.md](docs/ci-cd/github-actions.md).\n\nFor details on the Domain Distiller tool, see [docs/domain_distiller.md](docs/domain_distiller.md).\n\n## License\n\nSee the [LICENSE](LICENSE) file for more details.\n\n## Contact\n\n- GitHub: [https://github.com/gedankrayze/splade-model-trainer](https://github.com/gedankrayze/splade-model-trainer)\n- Email: info@gedankrayze.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgedankrayze%2Fsplade-model-trainer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgedankrayze%2Fsplade-model-trainer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgedankrayze%2Fsplade-model-trainer/lists"}