{"id":13677868,"url":"https://github.com/texttron/tevatron","last_synced_at":"2025-04-29T12:32:11.014Z","repository":{"id":38328612,"uuid":"404550703","full_name":"texttron/tevatron","owner":"texttron","description":"Tevatron - A flexible toolkit for neural retrieval research and development.","archived":false,"fork":false,"pushed_at":"2024-03-14T17:01:46.000Z","size":21239,"stargazers_count":373,"open_issues_count":20,"forks_count":74,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-03-15T18:18:26.689Z","etag":null,"topics":["dense-retrieval","dpr","flax","information-retrieval","jax","pytorch","question-answering","transformer"],"latest_commit_sha":null,"homepage":"http://tevatron.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/texttron.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-09-09T01:46:10.000Z","updated_at":"2024-04-28T00:25:47.928Z","dependencies_parsed_at":"2023-02-16T00:15:53.156Z","dependency_job_id":"e7adccaf-c8aa-47e5-94be-258cfda84077","html_url":"https://github.com/texttron/tevatron","commit_stats":{"total_commits":160,"total_committers":10,"mean_commits":16.0,"dds":0.40625,"last_synced_commit":"559cf0fab1691ca0e814ddef42b3288cc1d09a4b"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/texttron%2Ftevatron","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/texttron%2Ftevatron/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/texttron%2Ftevatron/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/texttron%2Ftevatron/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/texttron","download_url":"https://codeload.github.com/texttron/tevatron/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224173242,"owners_count":17268074,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dense-retrieval","dpr","flax","information-retrieval","jax","pytorch","question-answering","transformer"],"created_at":"2024-08-02T13:00:48.127Z","updated_at":"2025-04-29T12:32:10.972Z","avatar_url":"https://github.com/texttron.png","language":"Python","funding_links":[],"categories":["Python","Embedding Fine-tuning"],"sub_categories":["Frameworks"],"readme":"# Tevatron V2.0\n\n\u003cdiv align=\"center\"\u003e\n\u003ca href=\"https://arxiv.org/abs/2203.05765\" target=\"_blank\"\u003e\u003cimg src=https://img.shields.io/badge/arXiv-b5212f.svg?logo=arxiv\u003e\u003c/a\u003e\n\u003ca href=\"https://huggingface.co/Tevatron\" target=\"_blank\"\u003e\u003cimg src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace%20Datasets-27b3b4.svg\u003e\u003c/a\u003e\n\u003ca href=\"https://opensource.org/license/apache-2-0\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=License\u0026message=Apache-2.0\u0026color=red\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pepy.tech/projects/tevatron\"\u003e\u003cimg src=\"https://static.pepy.tech/badge/tevatron\" alt=\"PyPI Downloads\"\u003e\u003c/a\u003e\n\u003ca href=\"https://star-history.com/#texttron/tevatron\"\u003e \u003cimg src=\"https://img.shields.io/github/stars/texttron/tevatron?style=social\" alt=\"GitHub stars\"\u003e \u003c/a\u003e\n\u003c!--   --\u003e\n\u003c/div\u003e\n\nTevatron: Unified Document Retrieval Toolkit across Scale, Language, and Modality.\n\n\u003e Some of the features in Tevatron v1 is not yet migrated to Tevatron v2.0. We are working on it.\n\u003e If you are looking for the Tevatron v1 features, please pull the [v1 branch](https://github.com/texttron/tevatron/tree/tevatron-v1).\n\n## Features\n- Training billion-scale LLM neural retriever on GPUs and TPUs.\n- Parameter efficient tuning with LoRA.\n- Integration with vLLM, DeepSpeed, FlashAttention, gradient accumulation, and other efficient training and inference techniques.\n- Self-contained [huggingface datasets](https://huggingface.co/Tevatron) for multi-modal and multilingual neural retrieval and open-domain QA tasks.\n- Direct loading and finetuning SoTA pre-trained models (BGE-Embbedding, Instruct-E5) from HuggingFace.\n\n## Installation\n\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003ePyTorch (GPU)\u003c/b\u003e\u003c/summary\u003e\n\n0. Clone the repository.\n1. Install PyTorch based on your CUDA version from [PyTorch](https://pytorch.org/get-started/locally/).\n2. Install dependencies and Tevatron.\n```bash\npip install transformers datasets peft\npip install deepspeed accelerate\npip install faiss-cpu\npip install -e .\n```\n\n\n\u003c/details\u003e\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eJAX (TPU)\u003c/b\u003e\u003c/summary\u003e\n\n0. Clone the repository.\n1. Install JAX by following the [official guide](https://jax.readthedocs.io/en/latest/installation.html#pip-installation-google-cloud-tpu)\n2. Install dependencies\n```bash\npip install transformers datasets\npip install flax optax\n```\n3. Install Magix and GradCache\n```bash\ngit clone https://github.com/luyug/magix.git\ncd magix \u0026\u0026 pip install -e . \u0026\u0026 cd ..\ngit clone https://github.com/luyug/GradCache.git\ncd GradCache \u0026\u0026 pip install -e . \u0026\u0026 cd ..\n```\n\n4. Install Tevatron\n```bash\npip install -e .\n```\n\n\u003c/details\u003e\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eJAX (GPU)\u003c/b\u003e\u003c/summary\u003e\n\nTo run the JAX implementation of Tevatron on GPU, we encourage using the jax-toolbox [jax container](https://github.com/NVIDIA/JAX-Toolbox/pkgs/container/jax) image from NVIDIA.\n\nBelow is a Dockerfile example to set up Tevatron on top of the jax container.\n```Dockerfile\nFROM ghcr.io/nvidia/jax:jax-2024-03-08\n\nRUN apt-get update \u0026\u0026 \\\n    apt-get install -y --no-install-recommends python3-pip \u0026\u0026 \\\n    apt-get clean \u0026\u0026 \\\n    rm -rf /var/lib/apt/lists/* \u0026\u0026 \\\n    pip install --no-cache-dir transformers sentencepiece simple_parsing datasets orbax==0.4.8 \u0026\u0026 \\\n    pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu\n\nRUN git clone https://github.com/luyug/magix.git \u0026\u0026 \\\n    cd magix \u0026\u0026 pip install -e . \u0026\u0026 cd .. \u0026\u0026 \\\n    git clone https://github.com/luyug/GradCache.git \\\n    cd GradCache \u0026\u0026 pip install -e . \u0026\u0026 cd .. \\\n    git clone https://github.com/texttron/tevatron.git \u0026\u0026 \\\n    cd tevatron \u0026\u0026 pip install -e .\n```\n\n\n\n\n\u003c/details\u003e\n\n\n\n## Tevatron 101\nIn this example, we will demonstrate how to use Tevatron to LoRA fine-tune a Mistral-7B model on the MSMARCO passage dataset. The obtained LLM Retriever is expected to have `MRR@10=42.3` on the MS MARCO dev set with straightforward training.\n\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eData Preparation\u003c/b\u003e\u003c/summary\u003e\n\nTevatron takes training or inference data in `jsonl` format with each line organized as a json object as follows:\n### 1. Training Data\n```json\n{\n   \"query_id\": \"\u003cquery id\u003e\",\n   \"query_text\": \"\u003cquery text\u003e\",\n   \"query_image\": \"\u003cquery image\u003e\",\n   \"positive_document_ids\": [\"\u003cpassage id\u003e\", ...],\n   \"negative_document_ids\": [\"\u003cpassage id\u003e\", ...],\n}\n```\nwhere the passages in `positive_passages` are the annotated relevant passages of the `query` \nand passages in `negative_passages` are usually non-relevant (hard negative) passages from top results of a retrieval system (e.g. BM25, DPR). Additional fields such as `answers` for QA datasets can be included as well.\n\n#### 2. Corpus Data\n```json\n{\n   \"docid\": \"\u003cdocument id\u003e\",\n   \"document_text\": \"\u003cdocument text\u003e\",\n   \"document_image\": \"\u003cdocument image\u003e\",\n}\n```\nwhere each line represents a document in the corpus. \n\nNote that the image field for both training and corpus data are optional and can be omitted (i.e., pure textual modality retrieval).\n\n### Self-Contained Dataset\nTevatron self-contained several commonlly used datasets for neural retrieval. \n(via [HuggingFace](https://huggingface.co/Tevatron)).\nThese datasets can downloaded automatically during training and encoding\nby setting `--dataset_name \u003chgf dataset name\u003e`.\n\nIn this example, we will use the self-contained dataset `Tevatron/msmarco-passage-aug` for training, whose hard negative passages are sampled from the mix of top200 BM25 and top200 CoCondenser results.\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eRun with PyTorch (GPU)\u003c/b\u003e\u003c/summary\u003e\n\n### Training\n\n```bash\ndeepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \\\n  --deepspeed deepspeed/ds_zero3_config.json \\\n  --output_dir retriever-mistral \\\n  --model_name_or_path mistralai/Mistral-7B-v0.1 \\\n  --lora \\\n  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \\\n  --save_steps 50 \\\n  --dataset_name Tevatron/msmarco-passage-aug \\\n  --query_prefix \"Query: \" \\\n  --passage_prefix \"Passage: \" \\\n  --bf16 \\\n  --pooling eos \\\n  --append_eos_token \\\n  --normalize \\\n  --temperature 0.01 \\\n  --per_device_train_batch_size 8 \\\n  --gradient_checkpointing \\\n  --train_group_size 16 \\\n  --learning_rate 1e-4 \\\n  --query_max_len 32 \\\n  --passage_max_len 156 \\\n  --num_train_epochs 1 \\\n  --logging_steps 10 \\\n  --overwrite_output_dir \\\n  --gradient_accumulation_steps 4\n```\n\nIn batch passages per query: 8x4x16 = 512\n\nNumber of queries per update: 8x4x4 = 128\n\nThe above training setting tooks about 70 hours on 4xA6000 GPU.\n\nEquivalent training tooks about 110 hours on 1xA100 GPU.\n\n\n\n### Encoding\n\n#### Query Encoding\n```bash\nEMBEDDING_OUTPUT_DIR=\u003cfolder to save query embedding\u003e\nCUDA_VISIBLE_DEVICES=4 python -m tevatron.retriever.driver.encode \\\n  --output_dir=temp \\\n  --model_name_or_path mistralai/Mistral-7B-v0.1 \\\n  --lora_name_or_path retriever-mistral \\\n  --lora \\\n  --query_prefix \"Query: \" \\\n  --passage_prefix \"Passage: \" \\\n  --bf16 \\\n  --pooling eos \\\n  --append_eos_token \\\n  --normalize \\\n  --encode_is_query \\\n  --per_device_eval_batch_size 128 \\\n  --query_max_len 32 \\\n  --passage_max_len 156 \\\n  --dataset_name Tevatron/msmarco-passage \\\n  --dataset_split dev \\\n  --encode_output_path $EMBEDDING_OUTPUT_DIR/query-dev.pkl\n```\n\n#### Corpus Encoding\n```bash\nEMBEDDING_OUTPUT_DIR=\u003cfolder to save query embedding\u003e\nfor s in 0 1 2 3\ndo\ngpuid=$s\nCUDA_VISIBLE_DEVICES=$gpuid python -m tevatron.retriever.driver.encode \\\n  --output_dir=temp \\\n  --model_name_or_path mistralai/Mistral-7B-v0.1 \\\n  --lora_name_or_path retriever-mistral \\\n  --lora \\\n  --query_prefix \"Query: \" \\\n  --passage_prefix \"Passage: \" \\\n  --bf16 \\\n  --pooling eos \\\n  --append_eos_token \\\n  --normalize \\\n  --per_device_eval_batch_size 128 \\\n  --query_max_len 32 \\\n  --passage_max_len 156 \\\n  --dataset_name Tevatron/msmarco-passage-corpus \\\n  --dataset_number_of_shards 4 \\\n  --dataset_shard_index ${s} \\\n  --encode_output_path $EMBEDDING_OUTPUT_DIR/corpus.${s}.pkl\ndone\n```\n\u003e add \u0026 to the end of the command to run in the background in parallel.\n\n### Retrieval\n```bash\nset -f \u0026\u0026 python -m tevatron.retriever.driver.search \\\n    --query_reps $EMBEDDING_OUTPUT_DIR/query-dev.pkl \\\n    --passage_reps $EMBEDDING_OUTPUT_DIR/corpus*.pkl \\\n    --depth 1000 \\\n    --batch_size 64 \\\n    --save_text \\\n    --save_ranking_to $EMBEDDING_OUTPUT_DIR/run.dev.txt\n```\n\nThe output file is in the format of `\u003cquery_id\u003e \u003cpassage_id\u003e \u003cscore\u003e` in each line.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eRun with JAX (TPU/GPU)\u003c/b\u003e\u003c/summary\u003e\n\n### Training\n\n\u003e For GPU training, set `XLA_PYTHON_CLIENT_MEM_FRACTION=.95` and make sure the query and passage length are multiples of 64 if TransformersEngine is installed.\n\n```bash\npython -m tevatron.tevax.experimental.mp.train_lora  \\\n   --checkpoint_dir retriever-mistral-jax \\\n   --train_file Tevatron/msmarco-passage-aug \\\n   --model_name mistralai/Mistral-7B-v0.1 \\\n   --model_type mistral \\\n   --batch_size 128 \\\n   --num_target_passages 16 \\\n   --learning_rate 1e-4 \\\n   --seed 12345 \\\n   --mesh_shape 1 -1 \\\n   --weight_decay 0.00001 \\\n   --num_epochs 1 \\\n   --max_query_length 64 \\\n   --max_passage_length 128 \\\n   --pooling eos \\\n   --scale_by_dim True \\\n   --grad_cache \\\n   --passage_num_chunks 32 \\\n   --query_num_chunks 4\n```\n\nIn batch passages per query: 128x16 = 2048\n\nNumber of queries per update: 128\n\nThe above training setting tooks about 35 hours on a v4-8 TPU VM.\n\nEquivalent training tooks about 80 hours on 1xA100 GPU.\n\n### Encoding\n\n#### Query Encoding\n```bash\npython -m tevatron.tevax.experimental.mp.encode  \\\n   --model_type mistral \\\n   --model_name_or_path mistralai/Mistral-7B-v0.1 \\\n   --model_config_name_or_path mistralai/Mistral-7B-v0.1 \\\n   --tokenizer_name_or_path mistralai/Mistral-7B-v0.1 \\\n   --dataset_name_or_path Tevatron/msmarco-passage \\\n   --split dev \\\n   --output_dir $EMBEDDING_OUTPUT_DIR/query-embedding \\\n   --batch_size 32 \\\n   --input_type query \\\n   --max_seq_length 64 \\\n   --mesh_shape 1 -1 \\\n   --lora retriever-mistral-jax/lora \\\n   --scale_by_dim\n```\n\n#### Corpus Encoding\n```bash\npython -m tevatron.tevax.experimental.mp.encode  \\\n   --model_type mistral \\\n   --model_name_or_path mistralai/Mistral-7B-v0.1 \\\n   --model_config_name_or_path mistralai/Mistral-7B-v0.1 \\\n   --tokenizer_name_or_path mistralai/Mistral-7B-v0.1 \\\n   --dataset_name_or_path Tevatron/msmarco-passage-corpus \\\n   --output_dir $EMBEDDING_OUTPUT_DIR/corpus-embedding \\\n   --batch_size 32 \\\n   --input_type passage \\\n   --max_seq_length 128 \\\n   --mesh_shape 1 -1 \\\n   --lora retriever-mistral-jax/lora \\\n   --scale_by_dim\n```\n\n### Retrieval\n```bash\nset -f \u0026\u0026 python -m tevatron.retriever.driver.search \\\n    --query_reps $EMBEDDING_OUTPUT_DIR/query-embedding/*.pkl \\\n    --passage_reps $EMBEDDING_OUTPUT_DIR/corpus-embedding/*.pkl \\\n    --depth 1000 \\\n    --batch_size 64 \\\n    --save_text \\\n    --save_ranking_to $EMBEDDING_OUTPUT_DIR/run.dev.txt\n```\n\nThe output file is in the format of `\u003cquery_id\u003e \u003cpassage_id\u003e \u003cscore\u003e` in each line.\n\n\u003c/details\u003e\n\n## Examples\n+ [Unified multi-modal and multilingual retrieval](./examples/multimodal/README.md)\n+ [vLLM encoding and retrieval](./examples/example_repllama_vllm.md)\n\n## Citation\nIf you find Tevatron helpful, please consider citing our [paper](https://arxiv.org/abs/2203.05765).\n```\n@article{Gao2022TevatronAE,\n  title={Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval},\n  author={Luyu Gao and Xueguang Ma and Jimmy J. Lin and Jamie Callan},\n  journal={ArXiv},\n  year={2022},\n  volume={abs/2203.05765}\n}\n```\n\n\n## Contacts\nIf you have a toolkit specific question, feel free to open an issue. \n\nYou can also reach out to us for general comments/suggestions/questions through email.\n- Luyu Gao luyug@cs.cmu.edu\n- Xueguang Ma x93ma@uwaterloo.ca\n\n\n## Acknowledgement\n\n* We thank all the contributors of dependency libraries.\n* We thank Google's [TPU research cloud](https://sites.research.google/trc/about/) for providing TPU resources.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftexttron%2Ftevatron","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftexttron%2Ftevatron","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftexttron%2Ftevatron/lists"}