https://github.com/pisa-engine/ConstBERT
https://github.com/pisa-engine/ConstBERT
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/pisa-engine/ConstBERT
- Owner: pisa-engine
- Created: 2025-01-17T10:19:35.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-05T19:12:28.000Z (about 1 year ago)
- Last Synced: 2026-02-15T22:57:53.716Z (4 months ago)
- Size: 410 KB
- Stars: 12
- Watchers: 5
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-vector-databases - ConstBERT - Novel approach to reduce storage footprint of multi-vector retrieval by encoding each document with a fixed, smaller set of learned embeddings. Reduces index sizes by over 50% compared to ColBERT while retaining most effectiveness. ([Read more](/details/constbert.md)) `Multi Vector` `Compression` `Colbert` (Research Papers & Surveys)
README
# ConstBERT
## Efficient Constant-Space Multi-Vector Retrieval
**Code coming soon!**
This repository contains the source code for the paper:
**Efficient Constant-Space Multi-Vector Retrieval**
by [Sean MacAvaney](https://macavaney.us/), [Antonio Mallia](https://antoniomallia.it), and [Nicola Tonellotto](https://tonellotto.github.io/), published at **ECIR 2025**.
📄 [Read the paper (PDF)](./ConstBERT.pdf)
🏆 ConstBERT received the **Best Short Paper Honourable Mention** at **ECIR 2025**.
---
## 🔍 Overview
**ConstBERT** (Constant-Space BERT) is a multi-vector retrieval model designed for efficient and effective passage retrieval. It modifies the ColBERT architecture by encoding documents into a *fixed number of learned embeddings*, significantly reducing index size and improving storage and OS paging efficiency — all while retaining high retrieval effectiveness.
### Key Features:
- Fixed-size document representation (e.g., 32 vectors per document)
- Late interaction (MaxSim) for scoring
- End-to-end training of a pooling mechanism
- Comparable performance to ColBERT on MSMARCO and BEIR
- Efficient indexing and storage
---
## 🔗 Model Access
The pretrained model is available on Hugging Face:
👉 [https://huggingface.co/pinecone/ConstBERT](https://huggingface.co/pinecone/ConstBERT)
```python
from transformers import AutoModel
import numpy as np
def max_sim(q: np.ndarray, d: np.ndarray) -> float:
assert q.ndim == 2 and d.ndim == 2
scores = np.dot(d, q.T)
return float(np.sum(np.max(scores, axis=0)))
model = AutoModel.from_pretrained("pinecone/ConstBERT", trust_remote_code=True)
queries = ["What is the capital of France?", "latest advancements in AI"]
documents = [
"Paris is the capital and most populous city of France.",
"Artificial intelligence is rapidly evolving with new breakthroughs.",
"The Eiffel Tower is a famous landmark in Paris."
]
query_embeddings = model.encode_queries(queries).numpy()
document_embeddings = model.encode_documents(documents).numpy()
print(max_sim(query_embeddings[0], document_embeddings[0]) > max_sim(query_embeddings[0], document_embeddings[1]))
# Output: True
```
## 📦 Model Details
- **Architecture**: BERT-based encoder with a learned pooling layer
- **Embedding size**: 128
- **Document vectors per passage**: 32
- **Interaction**: MaxSim between document and query embeddings
### How it works
ConstBERT compresses token-level BERT embeddings into a *fixed number (C)* of document-level vectors using a learned linear projection. These vectors capture diverse semantic aspects of the document. Relevance is computed via a MaxSim operation between the query token embeddings and the fixed document vectors.
This design offers a trade-off between **storage/computation efficiency** and **retrieval effectiveness**, configurable by choosing the number of vectors `C`.
---
Please cite the following paper if you use this code, or a modified version of it:
```bibtex
@article{constbert,
title={Efficient Constant-Space Multi-Vector Retrieval},
author={MacAvaney, Sean and Mallia, Antonio and Tonellotto, Nicola},
booktitle = {The 47th European Conference on Information Retrieval ({ECIR})},
year={2025}
}
```
## 📎 Related Resources
- 🔬 [ColBERT: Original Multi-vector Retrieval Framework](https://github.com/stanford-futuredata/ColBERT)
- 📝 [Pinecone Blog](https://www.pinecone.io/blog/cascading-retrieval-with-multi-vector-representations/)
- 🔗 [The Turing Post](https://www.turingpost.com/p/bert)
---