Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/prrao87/lancedb-study

Benchmark study on LanceDB, an embedded vector DB, for full-text search and vector search
https://github.com/prrao87/lancedb-study

embedded-database lance lancedb vector-database vector-db

Last synced: 17 days ago
JSON representation

Benchmark study on LanceDB, an embedded vector DB, for full-text search and vector search

Awesome Lists containing this project

README

        

# LanceDB benchmark: Full-text and vector search performance

Code for the benchmark study described in this [blog post](https://thedataquarry.com/posts/embedded-db-3/).

[LanceDB](https://github.com/lancedb/lancedb) is an open source, embedded and developer-friendly vector database. Some key features about LanceDB that make it extremely valuable are listed below, among many others listed on their GitHub repo.

* Incredibly lightweight (no DB servers to manage), because it runs entirely in-process with the application
* Extremely scalable from development to production
* Ability to perform full-text search (FTS), SQL search (via [DataFusion](https://github.com/apache/arrow-datafusion)) *and* ANN vector search
* Multi-modal data support (images, text, video, audio, point-clouds, etc.)
* Zero-copy (via [Arrow](https://github.com/apache/arrow-rs)) with automatic versioning of data on its native [Lance](https://github.com/lancedb/lance) storage format

The aim of this repo is to demonstrate the full-text and vector search features of LanceDB via an end-to-end benchmark, in which we carefully study query results and throughput.

## Dataset

The dataset used for this demo is the [Wine Reviews](https://www.kaggle.com/zynicide/wine-reviews) dataset from Kaggle, containing ~130k reviews on wines along with other metadata. The dataset is converted to a ZIP archive, and the code for this as well as the ZIP data is provided here for reference.

## Comparison

Studying the performance of any tool in isolation is a challenge, so for the sake of comparison, an Elasticsearch workflow is provided in this repo. [Elasticsearch](https://github.com/elastic/elasticsearch) is a popular Lucene-based full-text and vector search engine whose use is regularly justified for full-text (and these days, vector search), so this makes it a meaningful tool to compare LanceDB against.

## Setup

Install the dependencies in virtual environment via `requirements.txt`.

```sh
# Setup the environment for the first time
python -m venv .venv # python -> python 3.11+

# Activate the environment (for subsequent runs)
source .venv/bin/activate

python -m pip install -r requirements.txt
```

## Benchmark results

> [!NOTE]
> * The numbers below are from a 2022 M2 Macbook Pro with 16GB RAM
> * The search space comprises 129,971 wine review descriptions in either LanceDB or Elasticsearch
> * The queries are randomly sampled from a list of 10 example queries for FTS and vector search, and run for 10, 100, 1000 and 10000 random queries
> * The vector dimensionality for the embeddings is 384 (`BAAI/bge-small-en-v1.5`)
> * Vector search in Elasticsearch is based on Lucene-HNSW, and in LanceDB, is based on IVF-PQ
> * The distance metric for vector search is cosine similarity in either DB
> * The run times reported (and QPS computed) are an average over 3 runs

### Summary of results for 10,000 random queries:

Case | Elasticsearch (QPS) | LanceDB (QPS)
:---|---:|---:
FTS: Serial | 399.8 | **468.9**
FTS: Concurrent | **1539.0** | 528.9
Vector search: Serial | 11.9 | **54.0**
Vector search: Concurrent | 50.7 | **71.6**

### Discussion

* Via their Python clients, LanceDB is clearly faster than Elasticsearch in terms of QPS (queries per second) for the vector search use case, and is also faster for the full-text search use case when using multiple threads concurrently.
* Elasticsearch is faster **only** for the FTS use case, specifically in the concurrent scenario likely because it uses a non-blocking async client (unlike LanceDB, for now).
* In the future, if an async (non-blocking) Python client is available for LanceDB, the throughput for LanceDB for FTS is expected to be even higher.

### Serial Benchmark

The serial benchmark shown below involves sequentially running queries in a sync for loop in Python. This isn't representative of a realistic use case in production, but is useful to understand the performance of the underlying search engines in each case (Lucene for Elasticsearch and Tantivy for LanceDB).

More details on this will be discussed in a blog post.

#### Full-text search (FTS)

Queries | Elasticsearch (sec)| Elasticsearch (QPS) | LanceDB (sec) | LanceDB (QPS)
:---:|:---:|:---:|:---:|:---:
10 | 0.0516 | **193.8** | 0.0518 | 193.0
100 | 0.2589 | 386.3 | 0.2383 | **419.7**
1000 | 2.5748 | 388.6 | 2.1759 | **459.3**
10000 | 25.0318 | 399.8 | 21.3196 | **468.9**

#### Vector search

Queries | Elasticsearch (sec)| Elasticsearch (QPS) | LanceDB (sec) | LanceDB (QPS)
:---:|:---:|:---:|:---:|:---:
10 | 0.8087 | 12.4 | 0.2158 | **46.3**
100 | 7.6020 | 13.1 | 1.6803 | **59.5**
1000 | 84.0086 | 11.9 | 16.7948 | **59.5**
10000 | 842.9494 | 11.9 | 185.0582 | **54.0**

### Concurrent Benchmark

The concurrent benchmark is designed to replicate a realistic use case for LanceDB or Elasticsearch - where multiple queries arrive at the same time, and the REST API on top of the DB has to handle asynchronous requests.

> [!NOTE]
> * The concurrency in Elasticsearch is achieved through its async client
> * The concurrency in LanceDB is achieved through Python's `multiprocessing` library on 4 worker threads (a higher number of threads resulted in slower performance).

#### Full-text search (FTS)

Queries | Elasticsearch (sec)| Elasticsearch (QPS) | LanceDB (sec) | LanceDB (QPS)
:---:|:---:|:---:|:---:|:---:
10 | 0.0350 | 285.7 | 0.0284 | **351.4**
100 | 0.1243 | **804.1** | 0.2049 | 487.8
1000 | 0.6972 | **1434.5** | 1.8980 | 526.8
10000 | 6.4948 | **1539.0** | 18.9136 | 528.9

#### Vector search

Queries | Elasticsearch (sec)| Elasticsearch (QPS) | LanceDB, 4 threads (sec) | LanceDB, 4 threads (QPS)
:---:|:---:|:---:|:---:|:---:
10 | 0.2896 | 34.5 | 0.1409 | **71.0**
100 | 2.5275 | 39.6 | 1.3367 | **74.8**
1000 | 20.4268 | 48.9 | 13.3158 | **75.1**
10000 | 197.2314 | 50.7 | 139.6330 | **71.6**