{"id":31831253,"url":"https://github.com/raphaelsty/neural-cherche","last_synced_at":"2025-10-11T21:48:49.320Z","repository":{"id":186224381,"uuid":"674827652","full_name":"raphaelsty/neural-cherche","owner":"raphaelsty","description":"Neural Search","archived":false,"fork":false,"pushed_at":"2025-03-11T08:54:24.000Z","size":3252,"stargazers_count":363,"open_issues_count":9,"forks_count":18,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-10-08T19:23:54.778Z","etag":null,"topics":["colbert","google","language-model","neural-search","semantic-search","sparseembed","splade","stanford","transformers"],"latest_commit_sha":null,"homepage":"https://raphaelsty.github.io/neural-cherche/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raphaelsty.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-04T22:25:42.000Z","updated_at":"2025-09-11T16:04:46.000Z","dependencies_parsed_at":null,"dependency_job_id":"4d1e20a4-8e2a-4b67-b038-e17046c20b7d","html_url":"https://github.com/raphaelsty/neural-cherche","commit_stats":null,"previous_names":["raphaelsty/sparsembed","raphaelsty/neural-cherche"],"tags_count":18,"template":false,"template_full_name":null,"purl":"pkg:github/raphaelsty/neural-cherche","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelsty%2Fneural-cherche","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelsty%2Fneural-cherche/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelsty%2Fneural-cherche/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelsty%2Fneural-cherche/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raphaelsty","download_url":"https://codeload.github.com/raphaelsty/neural-cherche/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelsty%2Fneural-cherche/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279008826,"owners_count":26084518,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-11T02:00:06.511Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["colbert","google","language-model","neural-search","semantic-search","sparseembed","splade","stanford","transformers"],"created_at":"2025-10-11T21:48:48.440Z","updated_at":"2025-10-11T21:48:49.315Z","avatar_url":"https://github.com/raphaelsty.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n  \u003ch1\u003eNeural-Cherche\u003c/h1\u003e\n  \u003cp\u003eNeural Search\u003c/p\u003e\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\u003cimg width=500 src=\"docs/img/logo.png\"/\u003e\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003c!-- Documentation --\u003e\n  \u003ca href=\"https://raphaelsty.github.io/neural-cherche/\"\u003e\u003cimg src=\"https://img.shields.io/website?label=Documentation\u0026style=flat-square\u0026url=https%3A%2F%2Fraphaelsty.github.io/neural-cherche/%2F\" alt=\"documentation\"\u003e\u003c/a\u003e\n  \u003c!-- License --\u003e\n  \u003ca href=\"https://opensource.org/licenses/MIT\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-MIT-blue.svg?style=flat-square\" alt=\"license\"\u003e\u003c/a\u003e\n\u003c/div\u003e\n\nNeural-Cherche is a library designed to fine-tune neural search models such as Splade, ColBERT, and SparseEmbed on a specific dataset. Neural-Cherche also provide classes to run efficient inference on a fine-tuned retriever or ranker. Neural-Cherche aims to offer a straightforward and effective method for fine-tuning and utilizing neural search models in both offline and online settings. It also enables users to save all computed embeddings to prevent redundant computations.\n\nNeural-Cherche is compatible with CPU, GPU and MPS devices. We can fine-tune ColBERT from any\nSentence Transformer pre-trained checkpoint. Splade and SparseEmbed are more tricky to fine-tune and need a MLM pre-trained model.\n\n## Installation\n\nWe can install neural-cherche using:\n\n```\npip install neural-cherche\n```\n\nIf we plan to evaluate our model while training install:\n\n```\npip install \"neural-cherche[eval]\"\n```\n\n## Documentation\n\nThe complete documentation is available [here](https://raphaelsty.github.io/neural-cherche/).\n\n## Quick Start\n\nYour training dataset must be made out of triples `(anchor, positive, negative)` where anchor is a query, positive is a document that is directly linked to the anchor and negative is a document that is not relevant for the anchor.\n\n```python\nX = [\n    (\"anchor 1\", \"positive 1\", \"negative 1\"),\n    (\"anchor 2\", \"positive 2\", \"negative 2\"),\n    (\"anchor 3\", \"positive 3\", \"negative 3\"),\n]\n```\n\nAnd here is how to fine-tune ColBERT from a Sentence Transformer pre-trained checkpoint using neural-cherche:\n\n```python\nimport torch\n\nfrom neural_cherche import models, utils, train\n\nmodel = models.ColBERT(\n    model_name_or_path=\"raphaelsty/neural-cherche-colbert\",\n    device=\"cuda\" if torch.cuda.is_available() else \"cpu\" # or mps\n)\n\noptimizer = torch.optim.AdamW(model.parameters(), lr=3e-6)\n\nX = [\n    (\"query\", \"positive document\", \"negative document\"),\n    (\"query\", \"positive document\", \"negative document\"),\n    (\"query\", \"positive document\", \"negative document\"),\n]\n\nfor step, (anchor, positive, negative) in enumerate(utils.iter(\n        X,\n        epochs=1, # number of epochs\n        batch_size=8, # number of triples per batch\n        shuffle=True\n    )):\n\n    loss = train.train_colbert(\n        model=model,\n        optimizer=optimizer,\n        anchor=anchor,\n        positive=positive,\n        negative=negative,\n        step=step,\n        gradient_accumulation_steps=50,\n    )\n\n    \n    if (step + 1) % 1000 == 0:\n        # Save the model every 1000 steps\n        model.save_pretrained(\"checkpoint\")\n```\n\n## Retrieval\n\nHere is how to use the fine-tuned ColBERT model to re-rank documents:\n\n```python\nimport torch\nfrom lenlp import sparse\n\nfrom neural_cherche import models, rank, retrieve\n\ndocuments = [\n    {\"id\": \"doc1\", \"title\": \"Paris\", \"text\": \"Paris is the capital of France.\"},\n    {\"id\": \"doc2\", \"title\": \"Montreal\", \"text\": \"Montreal is the largest city in Quebec.\"},\n    {\"id\": \"doc3\", \"title\": \"Bordeaux\", \"text\": \"Bordeaux in Southwestern France.\"},\n]\n\nretriever = retrieve.BM25(\n    key=\"id\",\n    on=[\"title\", \"text\"],\n    count_vectorizer=sparse.CountVectorizer(\n        normalize=True, ngram_range=(3, 5), analyzer=\"char_wb\", stop_words=[]\n    ),\n    k1=1.5,\n    b=0.75,\n    epsilon=0.0,\n)\n\nmodel = models.ColBERT(\n    model_name_or_path=\"raphaelsty/neural-cherche-colbert\",\n    device=\"cuda\" if torch.cuda.is_available() else \"cpu\",  # or mps\n)\n\nranker = rank.ColBERT(\n    key=\"id\",\n    on=[\"title\", \"text\"],\n    model=model,\n)\n\ndocuments_embeddings = retriever.encode_documents(\n    documents=documents,\n)\n\nretriever.add(\n    documents_embeddings=documents_embeddings,\n)\n```\n\nNow we can retrieve documents using the fine-tuned model:\n\n```python\nqueries = [\"Paris\", \"Montreal\", \"Bordeaux\"]\n\nqueries_embeddings = retriever.encode_queries(\n    queries=queries,\n)\n\nranker_queries_embeddings = ranker.encode_queries(\n    queries=queries,\n)\n\ncandidates = retriever(\n    queries_embeddings=queries_embeddings,\n    batch_size=32,\n    k=100,  # number of documents to retrieve\n)\n\n# Compute embeddings of the candidates with the ranker model.\n# Note, we could also pre-compute all the embeddings.\nranker_documents_embeddings = ranker.encode_candidates_documents(\n    candidates=candidates,\n    documents=documents,\n    batch_size=32,\n)\n\nscores = ranker(\n    queries_embeddings=ranker_queries_embeddings,\n    documents_embeddings=ranker_documents_embeddings,\n    documents=candidates,\n    batch_size=32,\n)\n\nscores\n```\n\n```python\n[[{'id': 0, 'similarity': 22.825355529785156},\n  {'id': 1, 'similarity': 11.201947212219238},\n  {'id': 2, 'similarity': 10.748161315917969}],\n [{'id': 1, 'similarity': 23.21628189086914},\n  {'id': 0, 'similarity': 9.9658203125},\n  {'id': 2, 'similarity': 7.308732509613037}],\n [{'id': 1, 'similarity': 6.4031805992126465},\n  {'id': 0, 'similarity': 5.601611137390137},\n  {'id': 2, 'similarity': 5.599479675292969}]]\n```\n\nNeural-Cherche provides a `SparseEmbed`, a `SPLADE`, a `TFIDF`, a `BM25` retriever and a `ColBERT` ranker which can be used to re-order output of a retriever. For more information, please refer to the [documentation](https://raphaelsty.github.io/neural-cherche/).\n\n### Pre-trained Models\n\nWe provide pre-trained checkpoints specifically designed for neural-cherche: [raphaelsty/neural-cherche-sparse-embed](https://huggingface.co/raphaelsty/neural-cherche-sparse-embed) and [raphaelsty/neural-cherche-colbert](https://huggingface.co/raphaelsty/neural-cherche-colbert). Those checkpoints are fine-tuned on a subset of the MS-MARCO dataset and would benefit from being fine-tuned on your specific dataset. You can fine-tune ColBERT from any Sentence Transformer pre-trained checkpoint in order to fit your specific language. You should use a MLM based-checkpoint to fine-tune SparseEmbed.\n\n\u003ctable class=\"tg\"\u003e\n\u003cthead\u003e\n  \u003ctr\u003e\n    \u003cth class=\"tg-0pky\"\u003e\u003c/th\u003e\n    \u003cth class=\"tg-0pky\"\u003e\u003c/th\u003e\n    \u003cth class=\"tg-rvyq\" colspan=\"3\"\u003escifact dataset\u003c/th\u003e\n  \u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-7btt\"\u003emodel\u003c/td\u003e\n    \u003ctd class=\"tg-7btt\"\u003eHuggingFace Checkpoint\u003c/td\u003e\n    \u003ctd class=\"tg-rvyq\"\u003endcg@10\u003c/td\u003e\n    \u003ctd class=\"tg-rvyq\"\u003ehits@10\u003c/td\u003e\n    \u003ctd class=\"tg-rvyq\"\u003ehits@1\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-c3ow\"\u003eTfIdf\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e-\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e0,62\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e0,86\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e0,50\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-c3ow\"\u003eBM25\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e-\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e0,69\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e0,92\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e0,56\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-c3ow\"\u003eSparseEmbed\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003eraphaelsty/neural-cherche-sparse-embed\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e0,62\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e0,87\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e0,48\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-c3ow\"\u003eSentence Transformer\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003esentence-transformers/all-mpnet-base-v2\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e0,66\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e0,89\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003e0,53\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-c3ow\"\u003eColBERT\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003eraphaelsty/neural-cherche-colbert\u003c/td\u003e\n    \u003ctd class=\"tg-7btt\"\u003e0,70\u003c/td\u003e\n    \u003ctd class=\"tg-7btt\"\u003e0,92\u003c/td\u003e\n    \u003ctd class=\"tg-7btt\"\u003e0,58\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-c3ow\"\u003eTfIDF Retriever + ColBERT Ranker\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003eraphaelsty/neural-cherche-colbert\u003c/td\u003e\n    \u003ctd class=\"tg-7btt\"\u003e0,71\u003c/td\u003e\n    \u003ctd class=\"tg-7btt\"\u003e0,94\u003c/td\u003e\n    \u003ctd class=\"tg-7btt\"\u003e0,59\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-c3ow\"\u003eBM25 Retriever + ColBERT Ranker\u003c/td\u003e\n    \u003ctd class=\"tg-c3ow\"\u003eraphaelsty/neural-cherche-colbert\u003c/td\u003e\n    \u003ctd class=\"tg-7btt\"\u003e0,72\u003c/td\u003e\n    \u003ctd class=\"tg-7btt\"\u003e0,95\u003c/td\u003e\n    \u003ctd class=\"tg-7btt\"\u003e0,59\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n### Neural-Cherche Contributors\n\n- [Benjamin Clavié](https://github.com/bclavie)\n- [Arthur Satouf](https://github.com/arthur-75)\n\n## References\n\n- *[SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking](https://arxiv.org/abs/2107.05720)* authored by Thibault Formal, Benjamin Piwowarski, Stéphane Clinchant, SIGIR 2021.\n\n- *[SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval](https://arxiv.org/abs/2109.10086)* authored by Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant, SIGIR 2022.\n\n- *[SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval](https://research.google/pubs/pub52289/)* authored by Weize Kong, Jeffrey M. Dudek, Cheng Li, Mingyang Zhang, and Mike Bendersky, SIGIR 2023.\n\n- *[ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/abs/2004.12832)* authored by Omar Khattab, Matei Zaharia, SIGIR 2020.\n\n## License\n\nThis Python library is licensed under the MIT open-source license, and the splade model is licensed as non-commercial only by the authors. SparseEmbed and ColBERT are fully open-source including commercial usage.\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraphaelsty%2Fneural-cherche","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraphaelsty%2Fneural-cherche","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraphaelsty%2Fneural-cherche/lists"}