{"id":27344453,"url":"https://github.com/yafuly/TPO","last_synced_at":"2025-04-12T17:06:25.455Z","repository":{"id":273462875,"uuid":"916402877","full_name":"yafuly/TPO","owner":"yafuly","description":null,"archived":false,"fork":false,"pushed_at":"2025-02-26T13:12:46.000Z","size":4845,"stargazers_count":96,"open_issues_count":0,"forks_count":5,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-02-26T14:25:17.426Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yafuly.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-14T02:50:41.000Z","updated_at":"2025-02-26T13:12:49.000Z","dependencies_parsed_at":"2025-02-26T14:23:29.827Z","dependency_job_id":"15ca4613-bcd6-4648-886f-42ea71831dd3","html_url":"https://github.com/yafuly/TPO","commit_stats":null,"previous_names":["yafuly/tpo"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yafuly%2FTPO","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yafuly%2FTPO/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yafuly%2FTPO/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yafuly%2FTPO/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yafuly","download_url":"https://codeload.github.com/yafuly/TPO/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248602310,"owners_count":21131615,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-12T17:02:16.425Z","updated_at":"2025-04-12T17:06:25.448Z","avatar_url":"https://github.com/yafuly.png","language":"Jupyter Notebook","funding_links":[],"categories":["A01_文本生成_文本对话","Papers"],"sub_categories":["大语言对话模型及数据","2025"],"readme":"\n# Test-Time Preference Optimization (TPO)\n\n[![Llama-3.1-70B-SFT](https://img.shields.io/badge/Model-Llama--3.1--70B--SFT-green)](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B-SFT) [![Llama-3.1-70B-Instruct](https://img.shields.io/badge/Model-Llama--3.1--70B--Instruct-green)](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) [![Mistral-Small-Instruct-2409](https://img.shields.io/badge/Model-Mistral--Small--Instruct--2409-green)](https://huggingface.co/mistralai/Mistral-Small-Instruct-2409) [![FsfairX-LLaMA3-RM-v0.1 ](https://img.shields.io/badge/Model-FsfairX--LLaMA3--RM--v0.1-green)](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1) [![Llama-3.1-Tulu-3-8B-RM](https://img.shields.io/badge/Model-Llama--3.1--Tulu--3--8B--RM-green)](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM)\n[![AlpacaEval 2.0](https://img.shields.io/badge/Task-AlpacaEval_2.0-red)](https://huggingface.co/datasets/tatsu-lab/alpaca_eval) [![Arena-Hard](https://img.shields.io/badge/Task-ArenaHard-red)](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1) [![HH-RLHF](https://img.shields.io/badge/Task-HH--RLHF-red)](https://huggingface.co/datasets/Anthropic/hh-rlhf) [![MATH-500](https://img.shields.io/badge/Task-MATH--500-red)](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) [![BeaverTails-Evaluation](https://img.shields.io/badge/Task-BeaverTails--Evaluation-red)](https://huggingface.co/datasets/PKU-Alignment/BeaverTails-Evaluation) [![XSTest](https://img.shields.io/badge/Task-XSTest-red)](https://huggingface.co/datasets/walledai/XSTest)\n\nThis repository contains the official code for the paper [Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback](https://arxiv.org/abs/2501.12895).\n\u003c!-- (https://arxiv.org/abs/XXX). --\u003e\n\n\n## 🔔 News\n\u003c!-- - **[20/01/2025]** Our paper is released on arXiv: https://arxiv.org/abs/XXX. --\u003e\n- **[23/01/2025]** Our paper is released at [https://arxiv.org/abs/2501.12895](https://arxiv.org/abs/2501.12895).\n- **[20/01/2025]** Our code is released! We are working on the paper and will release it very soon.\n\n\n## 👀 About TPO\n\nWe introduce Test-time Preference Optimization (TPO), a novel framework designed to align large language models (LLMs) with human preferences during inference without updating model parameters. TPO operates by translating reward signals into textual critiques and using these critiques as textual rewards to refine the model's responses iteratively, thereby enhancing alignment with human preferences.\n\n\u003cp align=\"center\"\u003e \u003cimg src=\"images/method.png\" width=\"90%\"\u003e \u003cbr\u003e\u003c/p\u003e\n\n\n\u003c!-- For more details, you can check our paper [here](https://arxiv.org/abs/XXX). --\u003e\n\n## 📈 Performance\n\n**Benchmark Performance**\n\nOur evaluations demonstrate that TPO enhances alignment with human preferences across a range of tasks, including instruction following, preference alignment, safety, and mathematics.  Benchmark results reveal that both unaligned and aligned models experience significant improvements after just a few TPO iterations.  Remarkably, the unaligned Llama-3.1-70B-SFT model outperforms the well-aligned `Llama-3.1-70B-Instruct` model on nearly all benchmarks.\n\n| Model                                   | AlpacaEval 2 LC(%)| AlpacaEval 2 WR(%)| Arena-Hard | HH-RLHF | BeaverTails | XSTest  | MATH-500 |\n|-----------------------------------------|-------------------|-------------------|------------|---------|-------------|---------|----------|\n| LLaMA-3.1-70B-DPO                       | 32.3              | 23.1              | 50.4       | -2.8    | -6.7        | 89.8    | 63.4     |\n| LLaMA-3.1-70B-Instruct                  | 36.9              | 34.9              | 59.0       | -0.5    | -6.4        | 88.7    | 66.4     |\n| LLaMA-3.1-70B-SFT                       | 27.8              | 16.8              | 44.1       | -4.1    | -7.2        | 87.8    | 61.8     |\n| w/ TPO (D2-N5) †                        | 33.2              | 39.5              | 70.5       | 0.1     | **-4.1**    | 89.8    | 70.0     |\n| w/ TPO (D2-N5) *                        | 33.0              | 40.5              | 69.7       | -0.6    | -4.8        | **90.4**| 71.2     |\n| w/ TPO (D5-N20) *                       | **37.8**          | **55.7**          | **77.5**   | **0.4** | **-4.1**    | 89.6    | **71.8** |\n\n| Model                      | AlpacaEval 2 LC(%) | AlpacaEval 2 WR(%) | Arena-Hard | HH-RLHF | BeaverTails | XSTest  | MATH-500 |\n|----------------------------|--------------------|--------------------|------------|---------|-------------|---------|----------|\n| Llama-3.1-70B-Instruct     | 36.9               | 34.9               | 59.0       | -0.5    | -6.4        | 88.7    | 66.4     |\n| w/ TPO (D2-N5) *           | 39.1               | 48.5               | 69.5       | **1.3** | -3.6        | 89.6    | **71.6** |\n| Mistral-Small-Instruct-2409| 45.7               | 38.5               | 53.8       | -0.4    | -5.2        | 87.1    | 57.6     |\n| w/ TPO (D2-N5) *           | **53.4**           | **60.5**           | **72.2**   | 1.1     | **-3.4**    | **90.7**| 62.2     |\n\nThese table highlights the performance gains of models after applying this approach, outperforming its baseline. Here, `D` refers to the maximum number of iterations, and `N` refers to the number of samples. `*` denotes the models optimized with TPO using the reward model `FsfairX-LLaMA3-RM-v0.1`, while `†` denotes `Llama-3.1-Tulu-3-8B-RM`.\n\n**Test-time Training**\n\n\u003cp align=\"center\"\u003e \u003cimg src=\"images/training.png\" width=\"100%\"\u003e \u003cbr\u003e\u003c/p\u003e\n\nThe figure shows that all models gradually align with the reward model during the TPO process. The colored lines represent models with test-time training, while the dashed lines represent those without. Additionally, we include a *revision* baseline, which iteratively refines the best cached response without considering rejected ones, thereby ignoring preference signals that indicate which responses are good or bad.\n\n## ⚙️ Environment Setup\nFollow the steps below to set up your environment: \n\n1. **Create a Virtual Environment:**\n\n   ```bash\n   conda create -n tpo python=3.10\n   conda activate tpo\n   ```\n\n2. **Download and Install Dependencies:**\n   ```bash\n   git clone https://github.com/yafuly/TPO.git\n   cd TPO\n   pip install -r requirements.txt\n   ```\n\n3. **Install TextGrad:**\n   ```bash\n   cd textgrad-main\n   pip install -e .\n   cd ..\n   ```\n\n## 💬 TPO Setup\n\nBy default, the TPO framework runs in a single-machine, single-GPU environment. The **vLLM server** is deployed using 4 GPUs in a tensor-parallel setup, and 1 GPU is utilized for generating the responses. The flow is as follows:\n\n1. **Setup vLLM Server**\n\n   This server hosts the model that will be optimized with TPO. To deploy the vLLM server, use the following command:\n   ```bash\n   vllm serve allenai/Llama-3.1-Tulu-3-70B-SFT --dtype auto --api-key token-abc123 --tensor-parallel-size 4 --port 8000\n   ```\n\n   For more information or additional configurations on starting the vLLM server, please refer to the official vLLM [documentation](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server).\n\n\n2. **Start TPO**\n\n   Run the following command to execute the TPO script, which runs a reward model to interact with the policy model deployed as the vLLM server:\n\n   ```bash\n   python run.py \\\n      --data_path data/sample.json \\\n      --ip $IP \\\n      --port 8000 \\\n      --server_model server-allenai/Llama-3.1-Tulu-3-70B-SFT \\\n      --reward_model sfairXC/FsfairX-LLaMA3-RM-v0.1 \\\n      --tpo_mode tpo \\\n      --max_tokens_response 2048 \\\n      --max_tokens_all 8192 \\\n      --sample_size 5 \\\n      --seed 7 \\\n      --max_iterations 2 \\\n      --num_threads 4\n   ```\n\n   Main parameters:\n   - `data_path`: Path to the data file (JSON). Refer to `data/sample.json` for more details.\n   - `ip`: Server IP address of the vLLM server, e.g., `localhost` or `127.0.0.1`.\n   - `port`: Port number for the vLLM server, e.g., `8000`.\n   - `server_model`: Base model used for serving via an API, e.g., `server-allenai/Llama-3.1-Tulu-3-70B-SFT` or `server-/mnt/models/reward_model/Llama-3.1-Tulu-3-70B-SFT`.\n   - `reward_model`: Identifier or path for the reward model, e.g., `sfairXC/FsfairX-LLaMA3-RM-v0.1` or `/mnt/models/reward_model`.\n   - `sample_size`: Number of responses to sample for each step (default: 5).\n   - `max_iterations`: Max number of test-time optimization iterations (default: 5).\n   - `num_threads`: Number of threads to use for generation. Increasing the `num_threads` can lead to faster generation by utilizing multiple processing cores simultaneously, thus **improving efficiency**. Set to 1 for limited computational resources.\n\n   For more parameters, please refer to the `run.py` file.\n\n   Upon running the script, log files will be generated in the `logs/` directory, stored in JSON format for easy parsing and analysis. Each iteration of the TPO optimization process captures four key items, all directly related to the large model: \n    - Input and output during the textual loss calculation, comparing the chosen and rejected responses.\n    - Input used to generate gradients.\n    - Output as the textual gradient.\n    - Iterative Optimization input, used for the next round of response generation. This structure allows for detailed tracking of the optimization process at each iteration.\n\n**Multi-Machine, Multi-GPU Setup**: If deploying the vLLM server on multiple machines with multiple GPUs, ensure you obtain the IP address of the vLLM server and use it in the `--ip` parameter. This allows the script to generate responses by querying the vLLM server running on a different machine. Ensure that both machines are connected to the same network and the server is accessible via the specified IP.\n\n\n## 📝 Citation\n\n```\n@misc{li2025testtimepreferenceoptimizationonthefly,\n      title={Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback}, \n      author={Yafu Li and Xuyang Hu and Xiaoye Qu and Linjie Li and Yu Cheng},\n      year={2025},\n      eprint={2501.12895},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2501.12895}, \n}\n```\n\n## 🌹 Acknowledgements\n\nThis project draws inspiration and support from several existing works:\n\n1. [TextGrad](https://github.com/zou-group/textgrad): We develop TPO atop the TextGrad framework, leveraging its ability to implement textual feedback.\n\n2. [vLLM](https://github.com/vllm-project/vllm): Our generation pipeline is built on the vLLM infrastructure.\n\n3. [RLHFlow](https://github.com/RLHFlow/RLHF-Reward-Modeling): We incorporate an off-the-shelf reward model provided by RLHFlow.\n\n4. [open-instruct](https://github.com/allenai/open-instruct): We adopt the SFT baseline from open-instruct.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyafuly%2FTPO","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyafuly%2FTPO","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyafuly%2FTPO/lists"}