https://github.com/jbellis/coherebench
https://github.com/jbellis/coherebench
Last synced: 6 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/jbellis/coherebench
- Owner: jbellis
- Created: 2024-06-18T18:56:55.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-12-26T15:00:19.000Z (about 1 year ago)
- Last Synced: 2025-06-21T06:02:31.091Z (7 months ago)
- Language: Java
- Size: 142 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Benchmarking program that throws data from Cohere's public Wikipedia dataset at a local Cassandra node. This gives us a more realistic dataset than random vectors would.
# Installation
1. edit config.properties dataset_location to where you want the (360GB) dataset
1. edit config.properties nodetool_path
1. `pip install datasets`
1. python download.py
# Running
## Insert Data
To insert data, use:
```bash
# Insert 1 million rows, skipping first 1000000
CB_CMD=insert CB_INSERT_ROWS=1000000 CB_SKIP=1000000 mvn compile exec:exec@run
# Default: CB_INSERT_ROWS=10000000, CB_SKIP=0
```
## Query Data
To run queries, use:
```bash
# Run simple ANN queries
CB_CMD=query CB_QUERY_TYPE=simple mvn compile exec:exec@run
# Run restrictive ANN queries (language='sq') corresponding to 1% of data
CB_CMD=query CB_QUERY_TYPE=restrictive mvn compile exec:exec@run
# Run unrestrictive ANN queries (language='en') corresponding to 99% of data
CB_CMD=query CB_QUERY_TYPE=unrestrictive mvn compile exec:exec@run
# Default: CB_QUERY_TYPE=simple, CB_QUERIES=10000 (number of queries to run)
```
# Comparing to postgresql
There's been some bitrot here, PG flavor needs to be updated to match C*.