https://github.com/orionw/promptriever
The first dense retrieval model that can be prompted like an LM
https://github.com/orionw/promptriever
Last synced: 2 months ago
JSON representation
The first dense retrieval model that can be prompted like an LM
- Host: GitHub
- URL: https://github.com/orionw/promptriever
- Owner: orionw
- Created: 2024-09-11T17:17:15.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-05-08T14:23:10.000Z (2 months ago)
- Last Synced: 2025-05-08T15:29:32.909Z (2 months ago)
- Language: Python
- Homepage: https://arxiv.org/abs/2409.11136
- Size: 61.5 KB
- Stars: 71
- Watchers: 2
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models
Official repository for the paper [Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models](https://arxiv.org/abs/2409.11136).
This repository contains the code and resources for Promptriever, which demonstrates that retrieval models can be controlled with prompts on a per-instance basis, similar to language models.
NOTICE: the **MTEB version of Promptriever is broken in v1, please use the v2 branch** which will become the main branch soon.
## Table of Contents
- [Links](#links)
- [Setup](#setup)
- [Experiments](#experiments)
- [MSMARCO](#msmarco-experiments)
- [BEIR](#beir-experiments)
- [Training](#training)
- [Utilities](#utilities)
- [Citation](#citation)## Links
| Binary | Description |
|:-------|:------------|
| [samaya-ai/promptriever-llama2-7b-v1](https://huggingface.co/samaya-ai/promptriever-llama2-7b-v1) | A Promptriever bi-encoder model based on LLaMA 2 (7B parameters).|
| [samaya-ai/promptriever-llama3.1-8b-instruct-v1](https://huggingface.co/samaya-ai/promptriever-llama3.1-8b-instruct-v1) | A Promptriever bi-encoder model based on LLaMA 3.1 Instruct (8B parameters).|
| [samaya-ai/promptriever-llama3.1-8b-v1](https://huggingface.co/samaya-ai/promptriever-llama3.1-8b-v1) | A Promptriever bi-encoder model based on LLaMA 3.1 (8B parameters).|
| [samaya-ai/promptriever-mistral-v0.1-7b-v1](https://huggingface.co/samaya-ai/promptriever-mistral-v0.1-7b-v1) | A Promptriever bi-encoder model based on Mistral v0.1 (7B parameters). |
| [samaya-ai/RepLLaMA-reproduced](https://huggingface.co/samaya-ai/RepLLaMA-reproduced) | A reproduction of the RepLLaMA model (no instructions). A bi-encoder based on LLaMA 2, trained on the [tevatron/msmarco-passage-aug](https://huggingface.co/datasets/Tevatron/msmarco-passage-aug) dataset. |
| [samaya-ai/msmarco-w-instructions](https://huggingface.co/samaya-ai/msmarco-w-instructions) | A dataset of MS MARCO with added instructions and instruction-negatives, used for training the above models. |## Setup
To initialize your research environment:
```bash
bash setup/install_conda.sh # if you don't have conda already
bash setup/install_req.sh
pip install git+https://github.com/orionw/tevatron
```## Experiments
### MSMARCO Experiments
Run a MSMARCO experiment (DL19, DL20, Dev) with:```bash
bash msmarco/encode_corpus.sh
bash msmarco/encode_queries.sh
bash msmarco/search.sh
```### BEIR Experiments
To reproduce the BEIR experiments you can either use the batch method (running all models):```bash
bash scripts/beir/matrix_of_corpus.sh
bash scripts/beir/matrix_of_prompts.sh
bash scripts/beir/search_all_prompts.sh
```Or can also run just one model with:
```bash
bash beir/run_all.sh
bash beir/run_all_prompts.sh
bash beir/search_all_prompts.sh
```The `beir/bm25` subfolder contains scripts for BM25 baseline experiments, using [BM25S](https://github.com/xhluca/bm25s).
## Training
To train a Promptriever model, you can use the scripts in `scripts/training/*`:```bash
bash scripts/training/train.sh
```Available training scripts:
- `train_instruct.sh` (Llama 2)
- `train_instruct_llama3_instruct.sh`
- `train_instruct_llama3.sh`
- `train_instruct_mistral_v1.sh`
- `train_instruct_mistral.sh` (v0.3)## Utilities
There are a variety of utilities to symlink corpus files (to avoid double storage when doing the dev set optimization), to upload models to Huggingface, and to filter out bad instruction-negatives.- `utils/symlink_dev.sh` and `utils/symlink_msmarco.sh`: Optimize storage usage
- `utils/upload_to_hf_all.py` and `utils/upload_to_hf.py`: Upload models to Hugging Face Hub
- `utils/validate_all_present.py`: Validate dataset completeness
- `filtering/filter_query_doc_pairs_from_batch_gpt.py`: Implement advanced filtering using GPT model outputs## Citation
If you found the code, data or model useful, free to cite:
```bibtex
@article{weller2024promptriever,
title={Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models},
author={Orion Weller and Benjamin Van Durme and Dawn Lawrie and Ashwin Paranjape and Yuhao Zhang and Jack Hessel},
year={2024},
eprint={2409.11136},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2409.11136},
}
```