https://github.com/texttron/tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
https://github.com/texttron/tevatron

dense-retrieval dpr flax information-retrieval jax pytorch question-answering transformer

Last synced: 2 months ago
JSON representation

Tevatron - A flexible toolkit for neural retrieval research and development.

Host: GitHub
URL: https://github.com/texttron/tevatron
Owner: texttron
License: apache-2.0
Created: 2021-09-09T01:46:10.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2024-03-14T17:01:46.000Z (over 1 year ago)
Last Synced: 2024-03-15T18:18:26.689Z (over 1 year ago)
Topics: dense-retrieval, dpr, flax, information-retrieval, jax, pytorch, question-answering, transformer
Language: Python
Homepage: http://tevatron.ai
Size: 20.3 MB
Stars: 373
Watchers: 9
Forks: 74
Open Issues: 20
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Tevatron V2.0

Tevatron: Unified Document Retrieval Toolkit across Scale, Language, and Modality.

> Some of the features in Tevatron v1 is not yet migrated to Tevatron v2.0. We are working on it.
> If you are looking for the Tevatron v1 features, please pull the [v1 branch](https://github.com/texttron/tevatron/tree/tevatron-v1).

## Features
- Training billion-scale LLM neural retriever on GPUs and TPUs.
- Parameter efficient tuning with LoRA.
- Integration with vLLM, DeepSpeed, FlashAttention, gradient accumulation, and other efficient training and inference techniques.
- Self-contained [huggingface datasets](https://huggingface.co/Tevatron) for multi-modal and multilingual neural retrieval and open-domain QA tasks.
- Direct loading and finetuning SoTA pre-trained models (BGE-Embbedding, Instruct-E5) from HuggingFace.

## Installation

PyTorch (GPU)

0. Clone the repository.
1. Install PyTorch based on your CUDA version from [PyTorch](https://pytorch.org/get-started/locally/).
2. Install dependencies and Tevatron.
```bash
pip install transformers datasets peft
pip install deepspeed accelerate
pip install faiss-cpu
pip install -e .
```

JAX (TPU)

0. Clone the repository.
1. Install JAX by following the [official guide](https://jax.readthedocs.io/en/latest/installation.html#pip-installation-google-cloud-tpu)
2. Install dependencies
```bash
pip install transformers datasets
pip install flax optax
```
3. Install Magix and GradCache
```bash
git clone https://github.com/luyug/magix.git
cd magix && pip install -e . && cd ..
git clone https://github.com/luyug/GradCache.git
cd GradCache && pip install -e . && cd ..
```

4. Install Tevatron
```bash
pip install -e .
```

JAX (GPU)

To run the JAX implementation of Tevatron on GPU, we encourage using the jax-toolbox [jax container](https://github.com/NVIDIA/JAX-Toolbox/pkgs/container/jax) image from NVIDIA.

Below is a Dockerfile example to set up Tevatron on top of the jax container.
```Dockerfile
FROM ghcr.io/nvidia/jax:jax-2024-03-08

RUN apt-get update && \
apt-get install -y --no-install-recommends python3-pip && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
pip install --no-cache-dir transformers sentencepiece simple_parsing datasets orbax==0.4.8 && \
pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu

RUN git clone https://github.com/luyug/magix.git && \
cd magix && pip install -e . && cd .. && \
git clone https://github.com/luyug/GradCache.git \
cd GradCache && pip install -e . && cd .. \
git clone https://github.com/texttron/tevatron.git && \
cd tevatron && pip install -e .
```

## Tevatron 101
In this example, we will demonstrate how to use Tevatron to LoRA fine-tune a Mistral-7B model on the MSMARCO passage dataset. The obtained LLM Retriever is expected to have `MRR@10=42.3` on the MS MARCO dev set with straightforward training.

Data Preparation

Tevatron takes training or inference data in `jsonl` format with each line organized as a json object as follows:
### 1. Training Data
```json
{
"query_id": "",
"query_text": "",
"query_image": "",
"positive_document_ids": ["", ...],
"negative_document_ids": ["", ...],
}
```
where the passages in `positive_passages` are the annotated relevant passages of the `query`
and passages in `negative_passages` are usually non-relevant (hard negative) passages from top results of a retrieval system (e.g. BM25, DPR). Additional fields such as `answers` for QA datasets can be included as well.

#### 2. Corpus Data
```json
{
"docid": "",
"document_text": "",
"document_image": "",
}
```
where each line represents a document in the corpus.

Note that the image field for both training and corpus data are optional and can be omitted (i.e., pure textual modality retrieval).

### Self-Contained Dataset
Tevatron self-contained several commonlly used datasets for neural retrieval.
(via [HuggingFace](https://huggingface.co/Tevatron)).
These datasets can downloaded automatically during training and encoding
by setting `--dataset_name `.

In this example, we will use the self-contained dataset `Tevatron/msmarco-passage-aug` for training, whose hard negative passages are sampled from the mix of top200 BM25 and top200 CoCondenser results.

Run with PyTorch (GPU)

### Training

```bash
deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
--deepspeed deepspeed/ds_zero3_config.json \
--output_dir retriever-mistral \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--lora \
--lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
--save_steps 50 \
--dataset_name Tevatron/msmarco-passage-aug \
--query_prefix "Query: " \
--passage_prefix "Passage: " \
--bf16 \
--pooling eos \
--append_eos_token \
--normalize \
--temperature 0.01 \
--per_device_train_batch_size 8 \
--gradient_checkpointing \
--train_group_size 16 \
--learning_rate 1e-4 \
--query_max_len 32 \
--passage_max_len 156 \
--num_train_epochs 1 \
--logging_steps 10 \
--overwrite_output_dir \
--gradient_accumulation_steps 4
```

In batch passages per query: 8x4x16 = 512

Number of queries per update: 8x4x4 = 128

The above training setting tooks about 70 hours on 4xA6000 GPU.

Equivalent training tooks about 110 hours on 1xA100 GPU.

### Encoding

#### Query Encoding
```bash
EMBEDDING_OUTPUT_DIR=
CUDA_VISIBLE_DEVICES=4 python -m tevatron.retriever.driver.encode \
--output_dir=temp \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--lora_name_or_path retriever-mistral \
--lora \
--query_prefix "Query: " \
--passage_prefix "Passage: " \
--bf16 \
--pooling eos \
--append_eos_token \
--normalize \
--encode_is_query \
--per_device_eval_batch_size 128 \
--query_max_len 32 \
--passage_max_len 156 \
--dataset_name Tevatron/msmarco-passage \
--dataset_split dev \
--encode_output_path $EMBEDDING_OUTPUT_DIR/query-dev.pkl
```

#### Corpus Encoding
```bash
EMBEDDING_OUTPUT_DIR=
for s in 0 1 2 3
do
gpuid=$s
CUDA_VISIBLE_DEVICES=$gpuid python -m tevatron.retriever.driver.encode \
--output_dir=temp \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--lora_name_or_path retriever-mistral \
--lora \
--query_prefix "Query: " \
--passage_prefix "Passage: " \
--bf16 \
--pooling eos \
--append_eos_token \
--normalize \
--per_device_eval_batch_size 128 \
--query_max_len 32 \
--passage_max_len 156 \
--dataset_name Tevatron/msmarco-passage-corpus \
--dataset_number_of_shards 4 \
--dataset_shard_index ${s} \
--encode_output_path $EMBEDDING_OUTPUT_DIR/corpus.${s}.pkl
done
```
> add & to the end of the command to run in the background in parallel.

### Retrieval
```bash
set -f && python -m tevatron.retriever.driver.search \
--query_reps $EMBEDDING_OUTPUT_DIR/query-dev.pkl \
--passage_reps $EMBEDDING_OUTPUT_DIR/corpus*.pkl \
--depth 1000 \
--batch_size 64 \
--save_text \
--save_ranking_to $EMBEDDING_OUTPUT_DIR/run.dev.txt
```

The output file is in the format of ` ` in each line.

Run with JAX (TPU/GPU)

### Training

> For GPU training, set `XLA_PYTHON_CLIENT_MEM_FRACTION=.95` and make sure the query and passage length are multiples of 64 if TransformersEngine is installed.

```bash
python -m tevatron.tevax.experimental.mp.train_lora \
--checkpoint_dir retriever-mistral-jax \
--train_file Tevatron/msmarco-passage-aug \
--model_name mistralai/Mistral-7B-v0.1 \
--model_type mistral \
--batch_size 128 \
--num_target_passages 16 \
--learning_rate 1e-4 \
--seed 12345 \
--mesh_shape 1 -1 \
--weight_decay 0.00001 \
--num_epochs 1 \
--max_query_length 64 \
--max_passage_length 128 \
--pooling eos \
--scale_by_dim True \
--grad_cache \
--passage_num_chunks 32 \
--query_num_chunks 4
```

In batch passages per query: 128x16 = 2048

Number of queries per update: 128

The above training setting tooks about 35 hours on a v4-8 TPU VM.

Equivalent training tooks about 80 hours on 1xA100 GPU.

### Encoding

#### Query Encoding
```bash
python -m tevatron.tevax.experimental.mp.encode \
--model_type mistral \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--model_config_name_or_path mistralai/Mistral-7B-v0.1 \
--tokenizer_name_or_path mistralai/Mistral-7B-v0.1 \
--dataset_name_or_path Tevatron/msmarco-passage \
--split dev \
--output_dir $EMBEDDING_OUTPUT_DIR/query-embedding \
--batch_size 32 \
--input_type query \
--max_seq_length 64 \
--mesh_shape 1 -1 \
--lora retriever-mistral-jax/lora \
--scale_by_dim
```

#### Corpus Encoding
```bash
python -m tevatron.tevax.experimental.mp.encode \
--model_type mistral \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--model_config_name_or_path mistralai/Mistral-7B-v0.1 \
--tokenizer_name_or_path mistralai/Mistral-7B-v0.1 \
--dataset_name_or_path Tevatron/msmarco-passage-corpus \
--output_dir $EMBEDDING_OUTPUT_DIR/corpus-embedding \
--batch_size 32 \
--input_type passage \
--max_seq_length 128 \
--mesh_shape 1 -1 \
--lora retriever-mistral-jax/lora \
--scale_by_dim
```

### Retrieval
```bash
set -f && python -m tevatron.retriever.driver.search \
--query_reps $EMBEDDING_OUTPUT_DIR/query-embedding/*.pkl \
--passage_reps $EMBEDDING_OUTPUT_DIR/corpus-embedding/*.pkl \
--depth 1000 \
--batch_size 64 \
--save_text \
--save_ranking_to $EMBEDDING_OUTPUT_DIR/run.dev.txt
```

The output file is in the format of ` ` in each line.

## Examples
+ [Unified multi-modal and multilingual retrieval](./examples/multimodal/README.md)
+ [vLLM encoding and retrieval](./examples/example_repllama_vllm.md)

## Citation
If you find Tevatron helpful, please consider citing our [paper](https://arxiv.org/abs/2203.05765).
```
@article{Gao2022TevatronAE,
title={Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval},
author={Luyu Gao and Xueguang Ma and Jimmy J. Lin and Jamie Callan},
journal={ArXiv},
year={2022},
volume={abs/2203.05765}
}
```

## Contacts
If you have a toolkit specific question, feel free to open an issue.

You can also reach out to us for general comments/suggestions/questions through email.
- Luyu Gao [email protected]
- Xueguang Ma [email protected]

## Acknowledgement

* We thank all the contributors of dependency libraries.
* We thank Google's [TPU research cloud](https://sites.research.google/trc/about/) for providing TPU resources.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/texttron/tevatron

Awesome Lists containing this project

README