{"id":31173275,"url":"https://github.com/kidist-amde/ddro","last_synced_at":"2025-09-19T12:47:43.199Z","repository":{"id":286691597,"uuid":"874188247","full_name":"kidist-amde/ddro","owner":"kidist-amde","description":"We introduce the direct document relevance optimization (DDRO) for training a pairwise ranker model. DDRO encourages the model to focus on document-level relevance during generation","archived":false,"fork":false,"pushed_at":"2025-08-25T15:31:43.000Z","size":2199,"stargazers_count":24,"open_issues_count":4,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-25T17:37:27.047Z","etag":null,"topics":["alignment","dense-retrieval","dpo","generative","generative-ai","generative-model","generative-retrieval","information-retrieval","ir","nlp","nlp-machine-learning","ranking-algorithm","ranking-system","rankings","retrieval","rlhf","semantic-id","semantic-ids"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kidist-amde.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-17T12:00:45.000Z","updated_at":"2025-08-25T15:31:46.000Z","dependencies_parsed_at":"2025-05-05T23:19:36.114Z","dependency_job_id":"3aa605a4-c509-49d3-a0b8-8a8433b5d2a2","html_url":"https://github.com/kidist-amde/ddro","commit_stats":null,"previous_names":["kidist-amde/ddro"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kidist-amde/ddro","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kidist-amde%2Fddro","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kidist-amde%2Fddro/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kidist-amde%2Fddro/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kidist-amde%2Fddro/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kidist-amde","download_url":"https://codeload.github.com/kidist-amde/ddro/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kidist-amde%2Fddro/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275941510,"owners_count":25556975,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-19T02:00:09.700Z","response_time":108,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","dense-retrieval","dpo","generative","generative-ai","generative-model","generative-retrieval","information-retrieval","ir","nlp","nlp-machine-learning","ranking-algorithm","ranking-system","rankings","retrieval","rlhf","semantic-id","semantic-ids"],"created_at":"2025-09-19T12:47:38.569Z","updated_at":"2025-09-19T12:47:43.182Z","avatar_url":"https://github.com/kidist-amde.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DDRO: Direct Document Relevance Optimization for Generative Information Retrieval\n\n[![Paper](https://img.shields.io/badge/Paper-arXiv%3A2504.05181-b31b1b.svg)](https://arxiv.org/abs/2504.05181)\n[![License](https://img.shields.io/badge/license-Apache--2.0-green)](LICENSE)\n[![HuggingFace](https://img.shields.io/badge/HF-Datasets-blueviolet)](https://huggingface.co/kiyam)\n\nThis repository contains the official implementation of our SIGIR 2025 paper:  \n📄 **[Lightweight and Direct Document Relevance Optimization for Generative IR (DDRO)](https://arxiv.org/abs/2504.05181)**\n -  Optimizing Generative Retrieval with Ranking-Aligned Objectives \n---\n\n### 🚧 Repository Under Development\n\nThis repository is actively under development. Thanks for your patience, changes and improvements may be applied frequently. Stay tuned for updates!\n\n---\n## 📑 Table of Contents\n\n- [Motivation](#motivation)\n- [What DDRO Does](#what-ddro-does)\n- [Learning Objectives](#learning-objectives-in-ddro)\n- [🛠️ Setup \u0026 Dependencies - Steps to Reproduce 🎯](#1-install-environment)\n- [Preprocessed Data \u0026 Model Checkpoints](#preprocessed-data--model-checkpoints)\n- [🔬 Evaluate Pre-trained Models from HuggingFace\n](#quick-evaluation)\n- [Citation](#citation)\n\n\n## Motivation\n\n**Misalignment in Learning Objectives:**  \nGen-IR models are typically trained via next-token prediction (cross-entropy loss) over docid tokens.  \nWhile effective for language modeling, this objective:\n- 🎯 Optimizes **token-level generation**\n- ❌ Not designed for **document-level ranking**\n\nAs a result, Gen-IR models are not directly optimized for **learning-to-rank**, which is the core requirement in IR systems.\n\n\n\n## What DDRO Does\n\nIn this work, we ask:\n\n\u003e _How can Gen-IR models directly learn to rank documents, instead of just predicting the next token?_\n\nWe propose **DDRO**:  \n**Lightweight and Direct Document Relevance Optimization for Gen-IR**\n\n### ✅ Key Contributions:\n- Aligns training objective with ranking by using **pairwise preference learning**\n- Trains the model to **prefer relevant documents over non-relevant ones**\n- Bridges the gap between **autoregressive training** and **ranking-based optimization**\n- Requires **no reinforcement learning or reward modeling**\n\n---\n\u003cimg src=\"src/arc_images/DDRO.drawio.png\" alt=\"DDRO training pipeline overview\" width=\"800\"/\u003e\n\n### Learning Objectives in DDRO\n\nWe optimize DDRO in two phases:\n\n---\n\n#### 📘 Phase 1: Supervised Fine-Tuning (SFT)\n\nLearn to generate the correct **docid** sequence given a query by minimizing the autoregressive token-level cross-entropy loss:\n\u003c!-- \n$$\n\\mathcal{L}_{\\text{SFT}} = -\\sum \\log p_\\theta(\\text{docid}_i \\mid \\text{docid}_{\u003ci}, q)\n$$ --\u003e\n\n - \u003cimg src=\"src/arc_images/loss_ntp.png\" alt=\"DDRO Image\" width=\"300\"/\u003e\n\n\nMaximize the likelihood of generating the correct docid given a query:\n\n\u003c!-- $$\n\\boxed{\n\\max_{\\pi} \\,\\, \\mathbb{E}_{(q, \\text{docid}) \\sim \\mathcal{D}} \\left[\n\\log \\pi(\\text{docid} \\mid q)\n\\right]\n}\n$$ --\u003e\n\n - \u003cimg src=\"src/arc_images/objective_ntp.png\" alt=\"DDRO Image\" width=\"300\"/\u003e\n---\n\n#### 📗 Phase 2: Pairwise Ranking Optimization (DDRO Loss)\n\nThis phase improves the **ranking quality** of generated document identifiers by applying a **pairwise learning-to-rank objective** inspired by **Direct Preference Optimization (DPO)**.\n\n📄 *Rafailov et al., 2023 — [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290)*\n\n\u003c!-- $$\n\\mathcal{L}_{\\text{DDRO}}(\\pi_\\theta; \\pi^{\\text{ref}}) = - \\mathbb{E}_{(q, \\text{docid}^+, \\text{docid}^-) \\sim \\mathcal{D}} \n\\left[\n\\log \\sigma \\left(\n\\beta \\log \\frac{\\pi_\\theta(\\text{docid}^+ \\mid q)}{\\pi^{\\text{ref}}(\\text{docid}^+ \\mid q)} -\n\\beta \\log \\frac{\\pi_\\theta(\\text{docid}^- \\mid q)}{\\pi^{\\text{ref}}(\\text{docid}^- \\mid q)}\n\\right)\n\\right]\n$$ --\u003e\n - \u003cimg src=\"src/arc_images/dpo_loss.png\" alt=\"DDRO Image\" width=\"600\"/\u003e\n\n### 📖 Description\n\nThis **Direct Document Relevance Optimization (DDRO)** loss guides the model to **prefer relevant documents (`docid⁺`) over non-relevant ones (`docid⁻`)** by comparing how both the current model and a frozen reference model score each document:\n\n* `docid⁺`: A relevant document for the query `q`\n* `docid⁻`: A non-relevant or less relevant document\n* $\\pi_\\theta$: The current model being optimized\n* $\\pi^{\\text{ref}}$: A frozen reference model (typically trained with SFT in Phase 1)\n* **β**: Temperature-like factor controlling sensitivity.\n* $\\sigma$: Sigmoid function, to map scores to \\[0,1] preference space\n\nEncourage the model to rank relevant docid⁺ higher than non-relevant docid⁻:\n\n\u003c!-- $$\n\\boxed{\n\\max_{\\pi} \\,\\, \\mathbb{E}_{(q, \\text{docid}^+, \\text{docid}^-) \\sim \\mathcal{D}} \\left[\n\\log \\sigma \\left(\n\\beta \\log \\frac{\\pi(\\text{docid}^+ \\mid q)}{\\pi_{\\text{ref}}(\\text{docid}^+ \\mid q)} -\n\\beta \\log \\frac{\\pi(\\text{docid}^- \\mid q)}{\\pi_{\\text{ref}}(\\text{docid}^- \\mid q)}\n\\right)\n\\right]\n}\n$$ --\u003e\n - \u003cimg src=\"src/arc_images/dpo_objective.png\" alt=\"DDRO Image\" width=\"500\"/\u003e\n\n### ✅ Usage\n\nThe DPO loss is used **after** the SFT phase to **fine-tune the ranking behavior** of the model. Instead of just generating `docid`, the model now **learns to rank `docid⁺` higher than `docid⁻`** in a relevance/preference-aligned manner.\n\n---\n\n### ✅ Why It Works\n\n- Directly **encourages higher generation scores for relevant documents**\n- Uses **contrastive ranking** rather than token-level generation\n- Avoids reward modeling or RL while remaining efficient and scalable\n\n\n---\n\n\n### 💡 Why DDRO is Different from Standard DPO\n\nWhile our optimization is inspired by the DPO framework [Rafailov et al., 2023](https://arxiv.org/abs/2305.18290), its adaptation to **Generative Document Retrieval** is **non-trivial**:\n\n- In contrast to open-ended preference alignment, our task involves **structured docid generation** under **beam decoding constraints**\n- Our model uses an **encoder-decoder** architecture rather than decoder-only\n- The objective is **document-level ranking**, not open-ended preference generation\n\nThis required **novel integration** of preference optimization into **retrieval-specific pipelines**, making DDRO uniquely suited for GenIR.\n\n## 📁 Project Structure\n\n```bash\nsrc/\n├── data/                # Data downloading, preprocessing, and docid instance generation\n├── pretrain/            # DDRO model training and evaluation logic (incl. ddro)\n├── scripts/             # Entry-point shell scripts for SFT, ddro, BM25, and preprocessing\n├── utils/               # Core utilities (tokenization, trie, metrics, trainers)\n├── ddro.yml             # Conda environment (for training DDRO)\n├── pyserini.yml         # Conda environment (for BM25 retrieval with Pyserini)\n├── README.md            # You're here!\n└── requirements.txt     # Additional Python dependencies\n```\n### 📌 Important\n  \u003c!-- - \u003ch5\u003e\u003cspan style=\"color:Yellow;\"\u003e➡️ Each subdirectory includes a detailed README.md with instructions.\u003c/span\u003e\u003c/h5\u003e --\u003e\n  \u003e 🔎 **Each subdirectory includes a detailed `README.md` with instructions.**\n\n---\n\n## 🛠️ Setup \u0026 Dependencies\n\n### 1. Install Environment\n\nClone the repository and create the conda environment:\n\n```bash\ngit clone https://github.com/kidist-amde/ddro.git\ncd ddro\nconda env create -f ddro_env.yml\nconda activate ddro_env\n```\n---\n\n### 2. Download Datasets and Pretrained Model\nWe use MS MARCO document (top-300k) and Natural Questions (NQ-320k) datasets, and a pretrained T5 model.\n\nTo download them, run the following commands from the project root (ddro/):\n\n   ```bash\n   bash   ./src/data/download/download_msmarco_datasets.sh\n   bash   ./src/data/download/download_nq_datasets.sh\n   python ./src/data/download/download_t5_model.py\n   ```\n📂 For details and download links, refer to: [src/data/download/README.md](https://github.com/kidist-amde/ddro/tree/main/src/data/download#readme)\n\n## 3. Data Preparation\nDDRO evaluated both on **Natural Questions (NQ)** and **MS MARCO** datasets. \n\n✅ Sample Top-300K MS MARCO Subset\nRun the following script to preprocess and extract the top-300K most relevant MS MARCO documents based on qrels:\n\n```bash\nbash scripts/preprocess/sample_top_docs.sh\n```\n- 📌 This will generate: resources/datasets/processed/msmarco-docs-sents.top.300k.json.gz\n(sentence-tokenized JSONL format, ranked by relevance frequency)\n---\n### Expected Directory Structure\nOnce everything is downloaded and processed, your resources/ directory should look like this:\n\n   ```\nresources/\n├── datasets/\n│   ├── raw/\n│   │   ├── msmarco-data/         # Raw MS MARCO dataset \n│   │   └── nq-data/              # Raw Natural Questions dataset\n│   └── processed/                # Preprocessed outputs\n└── transformer_models/\n         └── t5-base/                # Local copy of T5 model \u0026 tokenizer\n   ```\n---\n### 📌 Important\n  \u003c!-- - \u003ch5\u003e\n    \u003cspan style=\"color:pink;\"\u003e\n      ➡️ To process and sample both datasets, generate document IDs, and prepare training and evaluation instances, \n      please check the repo and the README below.\n    \u003c/span\u003e\n    - \u003c/h5\u003e\n      See: \u003ca href=\"https://github.com/kidist-amde/ddro/tree/main/src/data/dataprep#readme\"\u003e\n      \u003ccode\u003esrc/data/dataprep/README.md\u003c/code\u003e\n    \u003c/a\u003e --\u003e\n\n\u003e 🔎 To process and sample both datasets, generate document IDs, and prepare training/evaluation instances, please refer to the corresponding README:\n\n\u003e 🔗 [`src/data/dataprep/README.md`](https://github.com/kidist-amde/ddro/tree/main/src/data/data_prep#readme)\n---\n\n## Training Pipeline\n\n### 📘 Phase 1: Supervised Fine-Tuning (SFT)\n\nWe first train a **Supervised Fine-Tuning (SFT) model** using **next-token prediction** across three stages:\n\n1. **Pretraining** on document content (`doc → docid`)\n2. **Search Pretraining** on pseudo queries (`pseudoquery → docid`)\n3. **Finetuning** on real queries using supervised pairs from qrels (with gold docids) (`query → docid`)\n\nThis results in a **seed model** trained to autoregressively generate document identifiers.\n\nYou can run all stages with a single command:\n\n```bash\nbash ddro/src/scripts/sft/launch_SFT_training.sh\n```\n\n📍 The \\--encoding flag in the script supports id formats like pq, url.\n---\n\n## 🔧 Phase 2: DDRO Training (Pairwise Optimization)\n\nAfter training the SFT model (Phase 1), we apply **Phase 2: Direct Document Relevance Optimization**, which fine-tunes the model using a **pairwise ranking objective**, that trains the model to prefer relevant documents over non-relevant ones.\n\n\nThis bridges the gap between **autoregressive generation** and **ranking-based retrieval**.\n\nWe implement this using a custom version of Hugging Face's [`DPOTrainer`](https://github.com/huggingface/trl).\n\nRun DDRO training and evaluation:\n\n```bash\nbash scripts/ddro/slurm_submit_ddro_training.sh\nbash scripts/ddro/slurm_submit_ddro_eval.sh\n```\n\n---\n## Model Evaluation\n\n### 🔬 Evaluate Pre-trained Models from HuggingFace\n\nYou can directly evaluate our published models without training from scratch:\n\n#### Available Models:\n- `kiyam/ddro-msmarco-pq` - MS MARCO with PQ encoding\n- `kiyam/ddro-msmarco-tu` - MS MARCO with Title+URL encoding  \n- `kiyam/ddro-nq-pq` - Natural Questions with PQ encoding\n- `kiyam/ddro-nq-tu` - Natural Questions with Title+URL encoding\n\n#### Quick Evaluation:\n```bash\n# For SLURM clusters:\nsbatch src/pretrain/hf_eval/slurm_submit_hf_eval.sh\n\n# Or run directly:\nencoding=\"url_title\" # Choose from: \"url_title\", \"pq\"\n\npython src/pretrain/hf_eval/eval_hf_docid_ranking.py \\\n  --per_gpu_batch_size 4 \\\n  --log_path logs/msmarco/dpo_HF_url.log \\\n  --pretrain_model_path kiyam/ddro-msmarco-tu \\\n  --docid_path resources/datasets/processed/msmarco-data/encoded_docid/${encoding}_docid.txt \\\n  --test_file_path resources/datasets/processed/msmarco-data/eval_data/query_dev.${encoding}.jsonl \\\n  --dataset_script_dir src/data/data_scripts \\\n  --dataset_cache_dir ./cache \\\n  --num_beams 15 \\\n  --add_doc_num 6144 \\\n  --max_seq_length 64 \\\n  --max_docid_length 100 \\\n  --use_docid_rank True \\\n  --docid_format msmarco \\\n  --lookup_fallback True \\\n  --device cuda:0\n```\n\n#### Key Parameters:\n- `--encoding`: Use `\"url_title\"` or `\"pq\"` to match your model type\n- `--docid_format`: Use `\"msmarco\"` or `\"nq\"` depending on the dataset\n- `--pretrain_model_path`: Specify the HuggingFace model you want to evaluate\n\n#### Pre-generated Resources:\nYou can use our pre-generated encoded document IDs from [HuggingFace Datasets](https://huggingface.co/datasets/kiyam/ddro-docids) to skip the data preparation step.\n\n\n📂 Evaluation logs and metrics are saved to:\n```\nlogs/\noutputs/\n```\n---\n\n## 📚 Datasets Used\n\nWe evaluate DDRO on two standard retrieval benchmarks:\n\n- 📘 [MS MARCO Document Ranking](https://microsoft.github.io/msmarco/Datasets.html#document-ranking-dataset)\n- 📗 [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions)\n\n\n## Preprocessed Data \u0026 Model Checkpoints\n\nAll datasets, pseudo queries, docid encodings, and model checkpoints are available here:  \n🔗 [DDRO Generative IR Collection on Hugging Face 🤗](https://huggingface.co/collections/kiyam/ddro-generative-document-retrieval-680f63f2e9a72033598461c5)\n\n\n---\n\n## 🙏 Acknowledgments\n\nWe gratefully acknowledge the following open-source projects:\n\n- [ULTRON](https://github.com/smallporridge/WebUltron)\n- [HuggingFace TRL](https://github.com/huggingface/trl)\n- [NCI (Neural Corpus Indexer)](https://github.com/solidsea98/Neural-Corpus-Indexer-NCI)\n- [docTTTTTquery](https://github.com/castorini/docTTTTTquery)\n\n---\n\n## 📄 License\n\nThis project is licensed under the [Apache 2.0 License](LICENSE).\n\n---\n\n## Citation\n\n```bibtex\n@inproceedings{mekonnen2025lightweight,\n  title={Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval},\n  author={Mekonnen, Kidist Amde and Tang, Yubao and de Rijke, Maarten},\n  booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},\n  pages={1327--1338},\n  year={2025}\n}\n}\n```\n---\n\n## 📬 Contact\n\nFor questions, please open an [issue](https://github.com/kidist-amde/DDRO-Direct-Document-Relevance-Optimization/issues).\n\n\n\n© 2025 **Kidist Amde Mekonnen** · Made with ❤️ at [IRLab](https://irlab.science.uva.nl/), University of Amsterdam.\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkidist-amde%2Fddro","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkidist-amde%2Fddro","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkidist-amde%2Fddro/lists"}