{"id":27891137,"url":"https://github.com/dmis-lab/retpo","last_synced_at":"2025-05-05T11:53:38.173Z","repository":{"id":285231023,"uuid":"925018366","full_name":"dmis-lab/RetPO","owner":"dmis-lab","description":"[NAACL 2025 Findings] Ask Optimal Questions: Aligning Large Language Models with Retriever's Preference in Conversational Search","archived":false,"fork":false,"pushed_at":"2025-04-30T08:03:14.000Z","size":639,"stargazers_count":3,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-30T09:22:58.835Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2402.11827","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dmis-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-01-31T04:13:12.000Z","updated_at":"2025-04-30T08:03:17.000Z","dependencies_parsed_at":null,"dependency_job_id":"648c81e7-f55a-4560-a230-0c66471aeb32","html_url":"https://github.com/dmis-lab/RetPO","commit_stats":null,"previous_names":["dmis-lab/retpo"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2FRetPO","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2FRetPO/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2FRetPO/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2FRetPO/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dmis-lab","download_url":"https://codeload.github.com/dmis-lab/RetPO/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252495077,"owners_count":21757224,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-05T11:53:37.241Z","updated_at":"2025-05-05T11:53:38.094Z","avatar_url":"https://github.com/dmis-lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RetPO\n\nOfficial implementation of \"[Ask Optimal Questions: Aligning Large Language Models with Retriever’sPreference](https://arxiv.org/abs/2402.11827)\". \u003cbr\u003e\n\n\u003e [Chanwoong Yoon\u003csup\u003e1*\u003c/sup\u003e](https://scholar.google.com/citations?user=-9GfY0AAAAAJ\u0026hl=en), [Gangwoo Kim\u003csup\u003e1*\u003c/sup\u003e](https://scholar.google.com/citations?user=TmWGEFgAAAAJ\u0026hl=en), [Byeongguk Jeon\u003csup\u003e1\u003c/sup\u003e](https://scholar.google.com/citations?user=_Kw32VoAAAAJ\u0026hl=en), [Sungdong Kim\u003csup\u003e2,3\u003c/sup\u003e](https://scholar.google.com/citations?user=xKrSnDoAAAAJ\u0026hl=en), [Yohan Jo\u003csup\u003e4\u003c/sup\u003e](https://scholar.google.com/citations?user=xp3LGRQAAAAJ\u0026hl=en), [Jaewoo Kang\u003csup\u003e1\u003c/sup\u003e](https://scholar.google.co.kr/citations?user=RaBZafQAAAAJ\u0026hl=en)\u003cbr\u003e\n\u003e Korea University\u003csup\u003e1\u003c/sup\u003e, NAVER Cloud\u003csup\u003e2\u003c/sup\u003e, KAIST AI\u003csup\u003e3\u003c/sup\u003e, Seoul National University\u003csup\u003e4\u003c/sup\u003e \u003cbr\u003e\n\u003e In NAACL 2025.\n\n\u003cp align=\"center\"\u003e\n    📃 \u003ca href=\"https://arxiv.org/abs/2402.11827\" target=\"_blank\"\u003ePaper\u003c/a\u003e | 🤗 \u003ca href=\"\" target=\"_blank\"\u003eModel\u003c/a\u003e | 🤗 \u003ca href=\"https://huggingface.co/datasets/dmis-lab/RF-Collection\" target=\"_blank\"\u003eRF-Collection\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"50%\" src='./assets/overview.png' alt='Overview Image'/\u003e\n\u003c/p\u003e\n\n\u003e **Abstract** Conversational search, unlike single-turn retrieval tasks, requires understanding the current question within a dialogue context. The common approach of rewrite-then-retrieve aims to decontextualize questions to be self-sufficient for off-the-shelf retrievers, but most existing methods produce sub-optimal query rewrites due to the limited ability to incorporate signals from the retrieval results. To overcome this limitation, we present a novel framework RetPO (Retriever's Preference Optimization), which is designed to optimize a language model (LM) for reformulating search queries in line with the preferences of the target retrieval systems. The process begins by prompting a large LM to produce various potential rewrites and then collects retrieval performance for these rewrites as the retrievers' preferences. Through the process, we construct a large-scale dataset called RF collection, containing Retrievers' Feedback on over 410K query rewrites across 12K conversations. Furthermore, we fine-tune a smaller LM using this dataset to align it with the retrievers' preferences as feedback. The resulting model demonstrates superiority on two benchmarks, surpassing the previous state-of-the-art performance of rewrite-then-retrieve approaches, including GPT-3.5.\n\n## Content\n1. Installation Instructions\n2. Evaluation\n3. RetPO (Retriever's Preference Optimization) \n\n## 1. Installation Instructions\nPlease be aware that we utilize two distinct environments.\n1. retpo_search (retriever indexing and search)\n2. retpo_qr (QR model training and inference)\n\u003e The base retrieval code uses faiss-gpu, which is tied to specific versions of CUDA and torch. If the versions do not match, errors may occur. Therefore, we use separate environments.\n\n### retpo_search\nAs we require a lot of retrieval of dense retriever, we recommend to consider to use faiss-gpu.\n```bash\n# create environment\nconda create -n retpo python==3.9 \u0026\u0026 conda activate retpo_search\n\n# install torch\npip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116\n\n# faiss-cpu or faiss-gpu\n# CPU\npip install faiss-cpu==1.7.3\n# GPU\npip install https://github.com/kyamagu/faiss-wheels/releases/download/v1.7.3/faiss_gpu-1.7.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl\n\n# other requirements\npip install -r requirements.txt\n\n```\n\n### retpo_qr\n```bash\n# create environment\ncd retpo_qr/\nconda create -n retpo_qr python=3.10 \u0026\u0026 conda activate retpo_qr\n\n# install torch\npip install torch==2.1.0 # this specific version is crucial for reproducibility. you may need to install other variants based on your hardware.\n\n# install dependencies\npython -m pip install .\n\n# Flash Attention 2 (Optional, but Recommended for Faster Training)\n# If your machine has less than 96GB of RAM and many CPU cores, reduce MAX_JOBS, e.g.:\npython -m pip install flash-attn --no-build-isolation\n```\n\n## 2. Evaluation\n\n### Preparation\nWe mainly evaluate our method using two types of retrievers: BM25 and \u003ca href=\"https://github.com/microsoft/ANCE\" target=\"_blank\"\u003eANCE\u003c/a\u003e on two Conversational QA benchmarks: \u003ca href=\"https://github.com/McGill-NLP/topiocqa\" target=\"_blank\"\u003eTopiOCQA\u003c/a\u003e and \u003ca href=\"https://github.com/apple/ml-qrecc\" target=\"_blank\"\u003eQReCC\u003c/a\u003e.\n\nThere are well-organized repositories for preprocessing these datasets and indexing passages for retrieval. We recommend using them before running our code. We mainly refer to the \u003ca href=\"https://github.com/fengranMark/ConvGQR\" target=\"_blank\"\u003e_ConvGQR_\u003c/a\u003e as a reference.\n\nSpecifically, to run our code, you need to prepare following files.\n\n\u003e You can find the code to prepare these folders here:  \n\u003e pyserini_index/ # https://github.com/fengranMark/ConvGQR/blob/main/bm25/create_index.sh  \n\u003e tokenized/ # https://github.com/fengranMark/ConvGQR/blob/main/gen_tokenized_doc.py  \n\u003e embeddings/ # https://github.com/fengranMark/ConvGQR/blob/main/gen_doc_embeddings.py  \n\n```bash\nROOT_DIR/\n└── datasets/\n    └── checkpoints # Retriever checkpoints\n        └── ad-hoc-ance-msmarco # https://huggingface.co/3ricL/ad-hoc-ance-msmarco\n    └── topiocqa/\n        ├── pyserini_index/\n        ├── full_wiki_segments.tsv\n        ├── tokenized/\n        ├── embeddings/\n    └── qrecc/\n        ├── pyserini_index/ \n        ├── full_wiki_segments.tsv\n        ├── tokenized/\n        ├── embeddings/\n\n```\n\n### Reproduce our performance\nFor those who'd like to reproduce our reported performance, you can download our queries generated by RetPO from [this Google Drive](https://drive.google.com/drive/folders/1YyXpzb8QXjaKajI1kNKywAn3ZZtSGCDe?usp=drive_link). (Place it in the ```ROOT_DIR/distill_outputs``` of the repository.)\n\nYou can reproduce our main performance by running the following command.\n```bash\ncd eval\nbash ./scripts/bm25_topiocqa.sh\n```\n\n## 3. RetPO (Retriever's Preference Optimization) \n\n### Download RF-Collection\nWe construct a large-scale dataset called RF-COLLECTION, containing Retrievers’ Feedback on over 410K query rewrites across 12K conversations.\nYou can download it from [Huggingface](https://huggingface.co/datasets/dmis-lab/RF-Collection) using the following command.\n```python\nfrom datasets import load_dataset\n\nds = load_dataset(\"dmis-lab/RF-Collection\", cache_dir=\"{ROOT_DIR}/retpo_qr/\")\n```\n\u003c!-- \n### Running the Model\n\nSet corresponding environment variables first. \u003cbr\u003e\nFor $TRAIN_DATA, please set `train.json` for orconvqa or `train_filtered.json` for qrecc (only using examples including gt passages) \u003cbr\u003e\nOr you can define another dataset containing hard negatives after mining them. e.g., `train_negs.json` or `train_filtered_negs.json` \u003cbr\u003e\n\n```\nexport DATA_PATH=preprocessed\nexport OUTPUT_PATH=outputs\nexport N_GPU=$N_GPU\nexport TRAIN_DATA=$TRAIN_DATA\n```\n\n```\npython3 ddp_launcher.py \\\n  --data_path $DATA_PATH \\\n  --output_path $OUTPUT_PATH \\\n  --task $TASK \\\n  --model_name_or_path facebook/dpr-question_encoder-single-nq-base \\\n  --train_data $TRAIN_DATA \\\n  --dev_data dev.json \\\n  --test_data test.json \\\n  --train_batch_size 128 \\\n  --eval_batch_size 256 \\\n  --num_train_epochs 10 \\\n  --index_batch_size 512 \\\n  --learning_rate 3e-5 \\\n  --weight_decay 0.1 \\\n  --num_warmup_steps 0 \\\n  --n_hard_negative 0 \\\n  --top_k 100 \\\n  --max_buffer_size 1574824 \\\n  --do_predict\n```\n\nIf you include \"--do_predict\" argument, resulting outputs of the evaluation will be located in $OUTPUT_PATH. \u003cbr\u003e\n(WANRNING!) The indexing and retrieval for the inference take much time according to its whole passage collection size. \u003cbr\u003e\nIn other word, QReCC, which has about 50M passages, requires lots of time and memory consumptions. \u003cbr\u003e\nWe will add ANN search to increase the retrieval speed in the very soon.\n\n* index_(dev|test).faiss : FAISS index file. It requires 30-160GB of storage.\n* (dev|test)_eval_result.json: evaluation result including MRR and Recall@k\n* (dev|test)_eval_scores.json: top-k relevance scores for each query\n* (dev|test)_eval_incices.json: top-k indices for each query\n\nIn the case of QReCC, the overall result could be evaluated based on question types. \u003cbr\u003e\nFor this, please run the below script.\n\n`$ python3 eval_breakdown.py --result_data_file $OUTPUT_PATH/test_eval_scores.json --data_path $DATA_PATH`\n\n```\ntrec 371\n{'MRR': 0.32469379627601447, 'Recall@5': 0.4474393530997305, 'Recall@10': 0.555256064690027, 'Recall@20': 0.6531895777178797, 'Recall@100': 0.8045822102425876}\n\nquac 6396\n{'MRR': 0.5424993382858436, 'Recall@5': 0.6970137137211299, 'Recall@10': 0.7789605586634041, 'Recall@20': 0.8246712239580937, 'Recall@100': 0.8872222779071794}\n\nnq 1442\n{'MRR': 0.5268851536713797, 'Recall@5': 0.6412327583120463, 'Recall@10': 0.7337462549009012, 'Recall@20': 0.7976587435013233, 'Recall@100': 0.8827564813313773}\n\nno-switch 279\n{'MRR': 0.7206635174177008, 'Recall@5': 0.8315113500597372, 'Recall@10': 0.8806650736758264, 'Recall@20': 0.9111509358821187, 'Recall@100': 0.9414177618478694}\n\nswitch 573\n{'MRR': 0.3969942310926863, 'Recall@5': 0.52380653959188, 'Recall@10': 0.629972084343812, 'Recall@20': 0.7156843553963973, 'Recall@100': 0.8406293425403373}\n\nfirst 267\n{'MRR': 0.4818849831007729, 'Recall@5': 0.5770911360799, 'Recall@10': 0.7091136079900124, 'Recall@20': 0.7649812734082397, 'Recall@100': 0.8707865168539326}\n\nall 8209\n{'MRR': 0.5299129684113517, 'Recall@5': 0.6759358448588522, 'Recall@10': 0.7609080074038533, 'Recall@20': 0.8121761956265329, 'Recall@100': 0.8827029523174765}\n```\n\n\n### Mining Hard Negative\n\nIt outputs `$split_negs.json` and you should specify it for `$TRAIN_DATA` of retriever training.\n\n**Model-Negs**\n\nModel-based hard negative mining. It utilizes already finetuned vanilla DPR from the first stage. \u003cbr\u003e\n\n```\nexport DATA_PATH=preprocessed\nexport OUTPUT_PATH=$OUTPUT_PATH # checkpoint of already finetuned vanilla DPR\n\npython3 build_dense_negatives.py \\\n  --task $TASK\n  --data_path $DATA_PATH \\\n  --split train \\\n  --output_path $DATA_PATH \\\n  --model_name_or_path $OUTPUT_PATH \\\n  --index_batch_size 1024 \\\n  --top_k 100 \\\n  --iteration 1 \\\n```\n\nThe resulting file will be `$OUTPUT_PATH/train_negs.json`.\n\n**BM25-Negs**\n\n```\nexport DATA_PATH=dataset\nexport OUTPUT_PATH=preprocessed\n\npython3 build_bm25_negatives.py \\\n  --task $TASK \\\n  --split train \\\n  --read_by all \\\n  --raw_data_path $DATA_PATH \\\n  --preprocessed_data_path $OUTPUT_PATH \\\n  --pyserini_index_path $OUTPUT_PATH/$TASK/pyserini_index \\\n  --top_k 100\n```\n\nThe resulting file will be `$OUTPUT_PATH/$TASK/train_bm25_negs.json`.\n\n**CQR-Negs**\n\nFirst, preprocess the corresponding dataset by using `--use_rewrite_only` argument.\n\n```\nexport DATA_PATH=dataset\nexport OUTPUT_PATH=preprocessed\n\npython3 data_preprocessing.py \\\n  --task $TASK \\\n  --suffix rewrite \\\n  --data_path $DATA_PATH \\\n  --output_path $OUTPUT_PATH \\\n  --max_query_length 128 \\\n  --max_passage_length 384 \\\n  --pyserini_index_path $DATA_PATH/$TASK/pyserini_index \\\n  --use_rewrite_only\n```\n\nThen,\n\n```\nexport DATA_PATH=preprocessed\nexport OUTPUT_PATH=rewrite_negative\n\npython3 build_dense_negatives.py \\\n  --task $TASK\n  --data_path $DATA_PATH \\\n  --split train_rewrite \\\n  --output_path $DATA_PATH \\\n  --model_name_or_path $OUTPUT_PATH \\\n  --index_batch_size 1024 \\\n  --top_k 100 \\\n  --iteration 1 \\\n```\n\nThe resulting file will be `$OUTPUT_PATH/train_rewrite_negs.json`.\n\n## Playing with pretrained model from Huggingface Models\n\n```python\nimport json\nimport torch\nfrom transformers import DPRContextEncoder, DPRQuestionEncoder, AutoTokenizer\nfrom utils.conv_tokenizer import ConvTokenizer\n\nq_encoder = DPRQuestionEncoder.from_pretrained(\"dsksd/dpr-question_encoder-single-qrecc-model-base\")\nctx_encoder = DPRContextEncoder.from_pretrained(\"dsksd/dpr-ctx_encoder-single-qrecc-model-base\")\n\ntokenizer = AutoTokenizer.from_pretrained(\"dsksd/dpr-question_encoder-single-qrecc-model-base\")\nconv_tokenizer = ConvTokenizer(tokenizer)\n\nconversation = [\n    \"Who played the first game of the 2018 world cup?\",\n    \"Russia and Saudi played the opening match.\",\n    \"Which team won?\"\n]\n\npassages = json.load(open(\"assets/example_passages.json\", \"r\", encoding=\"utf-8\"))\n\nq_inputs = conv_tokenizer(\n      [conversation],\n      max_length=128,\n      padding=\"max_length\", # max_length or longest\n      truncation=True, # no other option here, (truncation from left-side)\n      retain_first_utter=True, # it retains first utterance when True\n      turn_delim_token=tokenizer.sep_token, # add delimiter token between utterance\n      return_tensors=\"pt\"\n)\n\nctx_inputs = tokenizer(passages, max_length=384, padding=\"max_length\", truncation=True, return_tensors=\"pt\")\n\nwith torch.no_grad():\n    q_vec = q_encoder(**q_inputs)\n    ctx_vec = ctx_encoder(**ctx_inputs)\n    \n    score = torch.matmul(ctx_vec[0], q_vec[0].transpose(0, 1)).squeeze()\n    _, idx = score.topk(3, 0)  # top-3\n\nfor i in idx:\n    print(passages[i])\n\n\u003e\u003e\u003e how russia beat saudi arabia in the world cup opener - ... **2018 russia comprehensively thrashed saudi arabia**, 5 - 0, ...\n``` --\u003e\n\n## Citation\n\n\n```bibtex\n@article{yoon2024ask,\n  title={Ask Optimal Questions: Aligning Large Language Models with Retriever's Preference in Conversational Search},\n  author={Yoon, Chanwoong and Kim, Gangwoo and Jeon, Byeongguk and Kim, Sungdong and Jo, Yohan and Kang, Jaewoo},\n  journal={arXiv preprint arXiv:2402.11827},\n  year={2024}\n}\n```\n\n## Contact\nFor more information or any questions of our work, feel free to contact me (cwyoon99 (at) korea.ac.kr or gmail.com). ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmis-lab%2Fretpo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdmis-lab%2Fretpo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmis-lab%2Fretpo/lists"}