https://github.com/jbellis/coherebench

Last synced: 6 months ago
JSON representation

Host: GitHub
URL: https://github.com/jbellis/coherebench
Owner: jbellis
Created: 2024-06-18T18:56:55.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-12-26T15:00:19.000Z (about 1 year ago)
Last Synced: 2025-06-21T06:02:31.091Z (7 months ago)
Language: Java
Size: 142 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Benchmarking program that throws data from Cohere's public Wikipedia dataset at a local Cassandra node. This gives us a more realistic dataset than random vectors would.

# Installation

1. edit config.properties dataset_location to where you want the (360GB) dataset
1. edit config.properties nodetool_path
1. `pip install datasets`
1. python download.py

# Running

## Insert Data
To insert data, use:

```bash
# Insert 1 million rows, skipping first 1000000
CB_CMD=insert CB_INSERT_ROWS=1000000 CB_SKIP=1000000 mvn compile exec:exec@run

# Default: CB_INSERT_ROWS=10000000, CB_SKIP=0
```

## Query Data
To run queries, use:

```bash
# Run simple ANN queries
CB_CMD=query CB_QUERY_TYPE=simple mvn compile exec:exec@run

# Run restrictive ANN queries (language='sq') corresponding to 1% of data
CB_CMD=query CB_QUERY_TYPE=restrictive mvn compile exec:exec@run

# Run unrestrictive ANN queries (language='en') corresponding to 99% of data
CB_CMD=query CB_QUERY_TYPE=unrestrictive mvn compile exec:exec@run

# Default: CB_QUERY_TYPE=simple, CB_QUERIES=10000 (number of queries to run)
```

# Comparing to postgresql

There's been some bitrot here, PG flavor needs to be updated to match C*.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jbellis/coherebench

Awesome Lists containing this project

README