https://github.com/raphaelsty/neural-cherche

Neural Search
https://github.com/raphaelsty/neural-cherche

colbert google language-model neural-search semantic-search sparseembed splade stanford transformers

Last synced: 9 months ago
JSON representation

Neural Search

Host: GitHub
URL: https://github.com/raphaelsty/neural-cherche
Owner: raphaelsty
License: mit
Created: 2023-08-04T22:25:42.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2025-03-11T08:54:24.000Z (over 1 year ago)
Last Synced: 2025-10-08T19:23:54.778Z (9 months ago)
Topics: colbert, google, language-model, neural-search, semantic-search, sparseembed, splade, stanford, transformers
Language: Python
Homepage: https://raphaelsty.github.io/neural-cherche/
Size: 3.1 MB
Stars: 363
Watchers: 5
Forks: 18
Open Issues: 9
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


  Neural-Cherche

  Neural Search







  

  

  

  



Neural-Cherche is a library designed to fine-tune neural search models such as Splade, ColBERT, and SparseEmbed on a specific dataset. Neural-Cherche also provide classes to run efficient inference on a fine-tuned retriever or ranker. Neural-Cherche aims to offer a straightforward and effective method for fine-tuning and utilizing neural search models in both offline and online settings. It also enables users to save all computed embeddings to prevent redundant computations.

Neural-Cherche is compatible with CPU, GPU and MPS devices. We can fine-tune ColBERT from any

Sentence Transformer pre-trained checkpoint. Splade and SparseEmbed are more tricky to fine-tune and need a MLM pre-trained model.

## Installation

We can install neural-cherche using:

```

pip install neural-cherche

```

If we plan to evaluate our model while training install:

```

pip install "neural-cherche[eval]"

```

## Documentation

The complete documentation is available [here](https://raphaelsty.github.io/neural-cherche/).

## Quick Start

Your training dataset must be made out of triples `(anchor, positive, negative)` where anchor is a query, positive is a document that is directly linked to the anchor and negative is a document that is not relevant for the anchor.

```python

X = [

    ("anchor 1", "positive 1", "negative 1"),

    ("anchor 2", "positive 2", "negative 2"),

    ("anchor 3", "positive 3", "negative 3"),

]

```

And here is how to fine-tune ColBERT from a Sentence Transformer pre-trained checkpoint using neural-cherche:

```python

import torch

from neural_cherche import models, utils, train

model = models.ColBERT(

    model_name_or_path="raphaelsty/neural-cherche-colbert",

    device="cuda" if torch.cuda.is_available() else "cpu" # or mps

)

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-6)

X = [

    ("query", "positive document", "negative document"),

    ("query", "positive document", "negative document"),

    ("query", "positive document", "negative document"),

]

for step, (anchor, positive, negative) in enumerate(utils.iter(

        X,

        epochs=1, # number of epochs

        batch_size=8, # number of triples per batch

        shuffle=True

    )):

    loss = train.train_colbert(

        model=model,

        optimizer=optimizer,

        anchor=anchor,

        positive=positive,

        negative=negative,

        step=step,

        gradient_accumulation_steps=50,

    )

    

    if (step + 1) % 1000 == 0:

        # Save the model every 1000 steps

        model.save_pretrained("checkpoint")

```

## Retrieval

Here is how to use the fine-tuned ColBERT model to re-rank documents:

```python

import torch

from lenlp import sparse

from neural_cherche import models, rank, retrieve

documents = [

    {"id": "doc1", "title": "Paris", "text": "Paris is the capital of France."},

    {"id": "doc2", "title": "Montreal", "text": "Montreal is the largest city in Quebec."},

    {"id": "doc3", "title": "Bordeaux", "text": "Bordeaux in Southwestern France."},

]

retriever = retrieve.BM25(

    key="id",

    on=["title", "text"],

    count_vectorizer=sparse.CountVectorizer(

        normalize=True, ngram_range=(3, 5), analyzer="char_wb", stop_words=[]

    ),

    k1=1.5,

    b=0.75,

    epsilon=0.0,

)

model = models.ColBERT(

    model_name_or_path="raphaelsty/neural-cherche-colbert",

    device="cuda" if torch.cuda.is_available() else "cpu",  # or mps

)

ranker = rank.ColBERT(

    key="id",

    on=["title", "text"],

    model=model,

)

documents_embeddings = retriever.encode_documents(

    documents=documents,

)

retriever.add(

    documents_embeddings=documents_embeddings,

)

```

Now we can retrieve documents using the fine-tuned model:

```python

queries = ["Paris", "Montreal", "Bordeaux"]

queries_embeddings = retriever.encode_queries(

    queries=queries,

)

ranker_queries_embeddings = ranker.encode_queries(

    queries=queries,

)

candidates = retriever(

    queries_embeddings=queries_embeddings,

    batch_size=32,

    k=100,  # number of documents to retrieve

)

# Compute embeddings of the candidates with the ranker model.

# Note, we could also pre-compute all the embeddings.

ranker_documents_embeddings = ranker.encode_candidates_documents(

    candidates=candidates,

    documents=documents,

    batch_size=32,

)

scores = ranker(

    queries_embeddings=ranker_queries_embeddings,

    documents_embeddings=ranker_documents_embeddings,

    documents=candidates,

    batch_size=32,

)

scores

```

```python

[[{'id': 0, 'similarity': 22.825355529785156},

  {'id': 1, 'similarity': 11.201947212219238},

  {'id': 2, 'similarity': 10.748161315917969}],

 [{'id': 1, 'similarity': 23.21628189086914},

  {'id': 0, 'similarity': 9.9658203125},

  {'id': 2, 'similarity': 7.308732509613037}],

 [{'id': 1, 'similarity': 6.4031805992126465},

  {'id': 0, 'similarity': 5.601611137390137},

  {'id': 2, 'similarity': 5.599479675292969}]]

```

Neural-Cherche provides a `SparseEmbed`, a `SPLADE`, a `TFIDF`, a `BM25` retriever and a `ColBERT` ranker which can be used to re-order output of a retriever. For more information, please refer to the [documentation](https://raphaelsty.github.io/neural-cherche/).

### Pre-trained Models

We provide pre-trained checkpoints specifically designed for neural-cherche: [raphaelsty/neural-cherche-sparse-embed](https://huggingface.co/raphaelsty/neural-cherche-sparse-embed) and [raphaelsty/neural-cherche-colbert](https://huggingface.co/raphaelsty/neural-cherche-colbert). Those checkpoints are fine-tuned on a subset of the MS-MARCO dataset and would benefit from being fine-tuned on your specific dataset. You can fine-tune ColBERT from any Sentence Transformer pre-trained checkpoint in order to fit your specific language. You should use a MLM based-checkpoint to fine-tune SparseEmbed.

  

    

    

    scifact dataset

  

  

    model

    HuggingFace Checkpoint

    ndcg@10

    hits@10

    hits@1

  

  

    TfIdf

    -

    0,62

    0,86

    0,50

  

  

    BM25

    -

    0,69

    0,92

    0,56

  

  

    SparseEmbed

    raphaelsty/neural-cherche-sparse-embed

    0,62

    0,87

    0,48

  

  

    Sentence Transformer

    sentence-transformers/all-mpnet-base-v2

    0,66

    0,89

    0,53

  

  

    ColBERT

    raphaelsty/neural-cherche-colbert

    0,70

    0,92

    0,58

  

  

    TfIDF Retriever + ColBERT Ranker

    raphaelsty/neural-cherche-colbert

    0,71

    0,94

    0,59

  

  

    BM25 Retriever + ColBERT Ranker

    raphaelsty/neural-cherche-colbert

    0,72

    0,95

    0,59

  

### Neural-Cherche Contributors

- [Benjamin Clavié](https://github.com/bclavie)

- [Arthur Satouf](https://github.com/arthur-75)

## References

- *[SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking](https://arxiv.org/abs/2107.05720)* authored by Thibault Formal, Benjamin Piwowarski, Stéphane Clinchant, SIGIR 2021.

- *[SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval](https://arxiv.org/abs/2109.10086)* authored by Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant, SIGIR 2022.

- *[SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval](https://research.google/pubs/pub52289/)* authored by Weize Kong, Jeffrey M. Dudek, Cheng Li, Mingyang Zhang, and Mike Bendersky, SIGIR 2023.

- *[ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/abs/2004.12832)* authored by Omar Khattab, Matei Zaharia, SIGIR 2020.

## License

This Python library is licensed under the MIT open-source license, and the splade model is licensed as non-commercial only by the authors. SparseEmbed and ColBERT are fully open-source including commercial usage.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/raphaelsty/neural-cherche

Awesome Lists containing this project

README

Neural-Cherche