https://github.com/jbellis/colbert-astra

Last synced: 9 months ago
JSON representation

Host: GitHub
URL: https://github.com/jbellis/colbert-astra
Owner: jbellis
Created: 2024-02-03T03:23:22.000Z (almost 2 years ago)
Default Branch: master
Last Pushed: 2024-09-12T14:38:54.000Z (over 1 year ago)
Last Synced: 2025-02-16T14:27:06.946Z (11 months ago)
Language: Python
Size: 47.9 KB
Stars: 11
Watchers: 2
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Colbert on Astra #

POC of ColBERT search, compared with vanilla DPR.

# Requirements #

* Assumes you have Cassandra (vsearch branch) running locally. Should "just work" with Astra given minor changes to db.py.
* Dataset with DPR ada002 embeddings already computed, this code does not do that (but adding it would just be a few lines)
* Download the ColBERT model from https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz and extract
it to the checkpoints/ subdirectory

# Usage #

1. cqlsh < create.cqlsh

2a. hack up compute-and-load.py to load your chunks. currently it expects json files that look like this:
{
'title': $title,
'0': { 'content': $raw_text, 'embedding': $ada002_embedding },
'1': { 'content': $raw_text, 'embedding': $ada002_embedding },
...
}

If you don't have pre-chunked documents, or you don't have or don't want to save a single dense embedding for comparison,
then adjust it accordingly.

2b. alternatively, hack up compute.py and load.py instead. `compute` computes the colbert embeddings and augments the json file with them, and `load` sends those to Cassandra. I did this because I wanted to compute the embeddings on a fast gpu machine.

3. `python serve_httpy.py` and navigate to http://localhost:5000

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jbellis/colbert-astra

Awesome Lists containing this project

README