https://github.com/lancedb/lancedb-duckdb-demo
Using the Lance extension in DuckDB for querying, joining and materializing on tables in either system
https://github.com/lancedb/lancedb-duckdb-demo
agent date-engineering duckdb etl lance lancedb retrieval search
Last synced: 23 days ago
JSON representation
Using the Lance extension in DuckDB for querying, joining and materializing on tables in either system
- Host: GitHub
- URL: https://github.com/lancedb/lancedb-duckdb-demo
- Owner: lancedb
- License: apache-2.0
- Created: 2026-04-02T11:55:11.000Z (25 days ago)
- Default Branch: main
- Last Pushed: 2026-04-02T17:13:03.000Z (24 days ago)
- Last Synced: 2026-04-04T07:55:21.634Z (23 days ago)
- Topics: agent, date-engineering, duckdb, etl, lance, lancedb, retrieval, search
- Language: Python
- Homepage:
- Size: 78.1 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# LanceDB x DuckDB Demo
A hands-on example showing how LanceDB and DuckDB work together on multimodal data. The repo ingests the [Amazon Berkeley Objects](https://amazon-berkeley-objects.s3.amazonaws.com/index.html) (ABO) dataset into a LanceDB catalog with CLIP image embeddings and text embeddings, then uses DuckDB with the [Lance extension](https://github.com/lance-format/lance-duckdb) to query, join, and materialize results via SQL.
The LanceDB `products` table stores product metadata, image paths, a CLIP multimodal vector (`multimodal_vec`), and a text-semantic vector (`text_vec`). DuckDB attaches the Lance directory as a namespace and runs SQL directly on top of it -- vector search, joins to local DuckDB tables, and aggregations to answer questions about the data.
## Setup
Requires Python 3.12+.
```bash
# Sync dependencies from pyproject.toml
uv sync
# Add dependencies as needed
uv add ...
```
### Download the ABO dataset
Download the listings metadata and small images from the [ABO dataset page](https://amazon-berkeley-objects.s3.amazonaws.com/index.html) and extract them into `data/`:
After extracting, the layout should look like:
```bash
data/
abo-listings/listings/metadata/ # listings_*.json.gz files
abo-images-small/images/
metadata/images.csv.gz # image metadata
small/ # downscaled image files
```
## Usage
### Ingest products into LanceDB
Embeddings are computed and written to LanceDB in batches, so progress is saved incrementally.
```bash
# Full ingestion (~145K products)
uv run python ingest.py
# Smoke test with a subset
uv run python ingest.py --limit 200
# Larger batches (default is 32)
uv run python ingest.py --batch-size 128
```
The output LanceDB directory is `./abo-products-lance` by default (override with `--output-root`).
### Text-semantic search
Search the catalog by text using the `text_vec` embedding column:
```bash
uv run python text_search.py
```
This encodes the text query with the `intfloat/multilingual-e5-base` multilingual embedding model and runs a top-k vector search over the Lance table via DuckDB.
### Create a DuckDB sales table
Before running the image search, generate a local `sales.duckdb` file with synthetic sales rows linked to shoe products in the Lance catalog:
```bash
uv run python create_duckdb.py
```
This reads shoe `item_id`s from the Lance table and creates a `sales` table with 100 randomly assigned purchases. The image search script joins against this table.
### Image-semantic search with sales join
Search by visual concept using the CLIP `multimodal_vec` column and join results to the `sales` table:
```bash
uv run python image_search.py
```
This encodes the query `"beige shoes"` with CLIP, retrieves the nearest products, and joins them to `sales.duckdb`.
### Query with the DuckDB CLI
You can also query the Lance table directly from the DuckDB CLI:
```bash
duckdb
```
```sql
INSTALL lance;
LOAD lance;
ATTACH './abo-products-lance' AS abo (TYPE LANCE);
SELECT item_id, title, brand, product_type
FROM abo.main.products
LIMIT 10;
```