{"id":27984407,"url":"https://github.com/FacebookResearch/sweet_rl","last_synced_at":"2025-05-08T05:01:57.841Z","repository":{"id":283410457,"uuid":"950247407","full_name":"facebookresearch/sweet_rl","owner":"facebookresearch","description":"Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks","archived":false,"fork":false,"pushed_at":"2025-05-05T18:35:45.000Z","size":4927,"stargazers_count":187,"open_issues_count":4,"forks_count":9,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-05-05T19:27:06.901Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-17T21:34:38.000Z","updated_at":"2025-05-05T18:35:48.000Z","dependencies_parsed_at":"2025-04-13T01:22:43.980Z","dependency_job_id":"fa3df973-c768-45d6-a39e-6081c721589b","html_url":"https://github.com/facebookresearch/sweet_rl","commit_stats":null,"previous_names":["facebookresearch/sweet_rl"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fsweet_rl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fsweet_rl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fsweet_rl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fsweet_rl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/sweet_rl/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253002856,"owners_count":21838640,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-08T05:01:52.891Z","updated_at":"2025-05-08T05:01:57.815Z","avatar_url":"https://github.com/facebookresearch.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks\r\n\r\nOfficial implementation for Collaborative Agent Bench and SWEET-RL.\r\n\r\n\u003cp align=\"center\"\u003e\r\n| \u003ca href=\"https://arxiv.org/abs/2503.15478\"\u003e\u003cb\u003ePaper\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"https://huggingface.co/datasets/facebook/collaborative_agent_bench\"\u003e\u003cb\u003eData\u003c/b\u003e\u003c/a\u003e |\r\n\u003c/p\u003e\r\n\r\n---\r\n\r\n[Yifei Zhou](https://yifeizhou02.github.io/), [Song Jiang](https://songjiang0909.github.io/), [Yuandong Tian](https://yuandong-tian.com/), [Jason Weston](https://ai.meta.com/people/1163645124801199/jason-weston/), [Sergey Levine](https://people.eecs.berkeley.edu/~svlevine/), [Sainbayar Sukhbaatar*](https://tesatory.github.io/), [Xian Li*](https://ai.meta.com/people/1804676186610787/xian-li/)\r\n\u003cbr\u003e\r\nUC Berkeley, FAIR\r\n\u003cbr\u003e\r\n*Equal advising\r\n![paper_teaser](paper_teaser.png)\r\n\r\n## Abstract\r\nLarge language model (LLM) agents need to perform multi-turn interactions in real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench, where an LLM agent interacts with a human collaborator over multiple turns to solve realistic tasks in backend programming and frontend design. Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL with Step-WisE Evaluation from Training-time information), that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. The critic provides step-level rewards for improving the policy model. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms, enabling Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic collaborative content creation.\r\n\r\n\r\n## Collaborative Agent Bench\r\n### Quick Start\r\nTo set up the environment for Collaborative Agent Bench, run:\r\n```bash\r\npip install -e .\r\ngit clone https://github.com/YifeiZhou02/collab_openrlhf\r\ncd collab_openrlhf\r\npip install -e .\r\n```\r\nThis should have set up the environment for Backend Programming, and it uses a custom fork of openrlhf to support multi-turn DPO and length normalization. \r\nOptionally, if you also wish to run Frontend Design, you need to install GeckoDriver and Firefox in your system(e.g. https://www.mozilla.org/en-US/firefox/all/desktop-release/ and the command below). \r\n```bash\r\nwget https://github.com/mozilla/geckodriver/releases/download/v0.35.0/geckodriver-v0.35.0-linux64.tar.gz\r\ntar -xvzf geckodriver-v0.35.0-linux64.tar.gz\r\nsudo mv geckodriver /usr/local/bin/\r\n```\r\nTo verify installation, run:\r\n```bash\r\ngeckodriver --version\r\n```\r\n\r\n\r\nNote that it is possible to install Firefox and GeckoDriver without sudo access by including the path to the applications in ```$PATH``` variable in your system.\r\n\r\nTo download data, run:\r\n```bash\r\nhuggingface-cli download facebook/collaborative_agent_bench backend_tasks/train.jsonl backend_tasks/test.jsonl colbench_code_offline_15k_llama8b.jsonl\r\n```\r\n\r\n### Testing Your Model on CollaborativeAgentBench\r\n#### Backend Programming\r\n\r\nFor testing on Backend Programming, you need to first set up an VLLM server as the simulation for human collaborator. To do that, simply run:\r\n```bash\r\npython -m vllm.entrypoints.openai.api_server --model /path/to/llama3.1-70b-instruct --max-model-len 16384 --tensor-parallel-size 8 --gpu-memory-utilization=0.85 --max-num-seqs 16 --port 8000 --enforce-eager --trust-remote-code \r\n```\r\nFeel free to use llama3.1-8b-instruct as simulator for the human collaborator for reduced gpu memory, but the result may be different from provided in the paper..\r\n\r\nAfter setting up the VLLM server for human collaborator, you can now test your model. For coding, run:\r\n```bash\r\npython scripts/simulate_interactions.py --agent_model /path/to/Llama-3.1-8B-Instruct \\\r\n    --hostname xxx or localhost \\\r\n    --task_type code \\\r\n    --num_tasks 1000 \\\r\n    --input_path /path/to/backend_tasks/test.jsonl \\\r\n    --output_path /path/for/output/temp_test.jsonl \\\r\n    --env_model /path/to/llama3.1-70b-instruct\r\npython scripts/evaluate_code.py /path/for/output/temp_test.jsonl\r\n```\r\nThe success rate and the percentage of tests passed will be printed in the end. Note that sometimes LLM generated code might contain print messages, so part of the outputs might be flooded with those messages.\r\n\u003cbr\u003e\r\nWe also offer a script for you to visualize the trajectories, run:\r\n```bash\r\npython visualizers/visualize_dialogue_histories.py /path/for/output/temp_test.jsonl\r\n```\r\n#### Frontend Design\r\nYou can run the following script to download data from WebSight:\r\n```python\r\nfrom sweet_rl.utils.webpage_utils import replace_urls, render_full_html\r\nimport json\r\nfrom tqdm import tqdm\r\ntrain_tasks_path = \"/your/data/path/frontend_tasks/train.jsonl\"\r\ntest_tasks_path = \"/your/data/path/frontend_tasks/test.jsonl\"\r\n\r\nfrom datasets import load_dataset\r\n\r\nds = load_dataset(\"HuggingFaceM4/WebSight\", \"v0.2\")[\"train\"]\r\n\r\n\r\nfiltered_data = []\r\nfor i in tqdm(range(20000)):\r\n    filtered_data.append({\r\n        \"problem_description\": ds[i][\"llm_generated_idea\"], \r\n        \"ground_truth\": replace_urls(ds[i][\"text\"]),\r\n    })\r\n\r\nwith open(train_tasks_path, \"w\") as f:\r\n    for d in filtered_data[:10000]:\r\n        f.write(json.dumps(d) + \"\\n\")\r\n\r\nwith open(test_tasks_path, \"w\") as f:\r\n    for d in filtered_data[10000:]:\r\n        f.write(json.dumps(d) + \"\\n\")\r\n\r\n```\r\n\r\nFor testing on Frontend Design, you need to first set up an VLLM server as the simulation for human collaborator. To do that, simply run:\r\n```bash\r\npython -m vllm.entrypoints.openai.api_server --model /path/to/Qwen2-VL-72B-Instruct --max-model-len 16384 --tensor-parallel-size 8 --gpu-memory-utilization=0.85 --max-num-seqs 16 --port 8000 --enforce-eager --limit-mm-per-prompt image=2 --trust-remote-code \r\n```\r\nFeel free to use Qwen2-VL-7B-Instruct as simulator for the human collaborator for reduced gpu memory, but the result may be different from provided in the paper.\r\n\r\n\r\nAfter setting up the VLLM server for human collaborator, you can now test your model for Frontend Design, run:\r\n```bash\r\npython scripts/simulate_interactions.py --agent_model /path/to/Llama-3.1-8B-Instruct \\\r\n    --task_type html \\\r\n    --num_tasks (100 for fast tests, 500 for paper results) \\\r\n    --hostname xxx or localhost \\\r\n    --output_path /path/for/output/temp_test_html.jsonl\\\r\n    --input_path /path/to/webpage_tasks_all.jsonl \\\r\n    --env_model /path/to/Qwen2-VL-72B-Instruct \\\r\npython scripts/evaluate_html.py /path/for/output/temp_test_html.jsonl \r\n```\r\n\r\nThe average cosine similarity will be printed in the end. We also offer a script for you to visualize the trajectories, run:\r\n```bash\r\npython visualizers/visualize_design_dialogue_histories.py /path/for/output/temp_test_html.jsonl\r\n```\r\n\r\n## SWEET-RL (**S**tep-**W**is**E** **E**valuation w/ Training-time information)\r\nNow we provide an example script for running SWEET-RL on Backend Programming. This part assumes that you have set up the environment for Backend Programming.\r\nFirst set up the paths for loading data and saving intermediate results.\r\n```bash\r\nDATA_PATH=/xxx/colbench_code_offline_15k_llama8b.jsonl\r\n\r\nOUTPUT_DIR=/xxx/collab_llm/outputs\r\nCHECKPOINT_DIR=/xxx/collab_llm/checkpoints\r\n```\r\nThe intermediate data and checkpoints will be saved to:\r\n```bash\r\nGROUND_TRUTH_PREFERENCES_PATH=$OUTPUT_DIR/temp_ground_truth_preferences.jsonl\r\nREWARD_PATH=$CHECKPOINT_DIR/temp_rm\r\nSAMPLED_PATH=$OUTPUT_DIR/temp_sampled.jsonl\r\nRANKED_PATH=$OUTPUT_DIR/temp_ranked.jsonl\r\nRANDOM_PAIRS_PATH=$OUTPUT_DIR/temp_random_pairs.jsonl\r\nSAVE_PATH=$CHECKPOINT_DIR/temp_dpo\r\nEVALUATION_PATH=$OUTPUT_DIR/temp_evaluation.jsonl\r\n```\r\nWe will first train a step-level reward model:\r\n```bash\r\n# first train the step-level reward model with additional training-time information\r\npython scripts/evaluate_code.py $DATA_PATH --k 3 --ground_truth_preference_path $GROUND_TRUTH_PREFERENCES_PATH\r\n\r\ndeepspeed --module openrlhf.cli.train_dpo \\\r\n   --save_path $REWARD_PATH \\\r\n   --save_steps -1 \\\r\n   --logging_steps 1 \\\r\n   --eval_steps -1 \\\r\n   --train_batch_size 8 \\\r\n   --micro_train_batch_size 1 \\\r\n   --pretrain /PATH/TO/8BLLAMA \\\r\n   --bf16 \\\r\n   --max_epochs 4 \\\r\n   --max_len 8192 \\\r\n   --zero_stage 3 \\\r\n   --learning_rate 2e-7 \\\r\n   --beta 0.1 \\\r\n   --dataset $GROUND_TRUTH_PATH \\\r\n   --chosen_key chosen \\\r\n   --rejected_key rejected \\\r\n   --flash_attn \\\r\n   --gradient_checkpointing \\\r\n   --use_wandb WANDB_KEY \\\r\n   --response_template \"\u003c|start_header_id|\u003eassistant\u003c|end_header_id|\u003e\" \\\r\n   --wandb_run_name sweet_code_rm \\\r\n   --mean_log_prob\r\n```\r\nAfter that, we can use this step-level reward model to generate step-level preference pairs:\r\n```bash\r\n# # Those commands will generate preference pairs given the step-level reward model\r\npython scripts/sample_best_of_n.py $DATA_PATH $SAMPLED_PATH --data_fraction 0.1\r\n\r\n\r\npython scripts/rank_best_of_n.py --model_id $REWARD_PATH \\\r\n    --input_path  $SAMPLED_PATH \\\r\n    --output_path $RANKED_PATH \r\n\r\n\r\npython scripts/generate_random_pairs_from_ranks.py $RANKED_PATH $RANDOM_PAIRS_PATH --no_prompt --num_pairs 4\r\n```\r\nFinally we can train the model and perform evaluations:\r\n```bash\r\n# # Train the model with step-level preference pairs\r\ndeepspeed --module openrlhf.cli.train_dpo \\\r\n   --save_path $SAVE_PATH \\\r\n   --save_steps -1 \\\r\n   --logging_steps 1 \\\r\n   --eval_steps  -1 \\\r\n   --train_batch_size 8 \\\r\n   --micro_train_batch_size 1 \\\r\n   --pretrain /PATH/TO/Meta-Llama-3.1-8B-Instruct \\\r\n   --bf16 \\\r\n   --max_epochs 1 \\\r\n   --max_len 16384 \\\r\n   --zero_stage 3 \\\r\n   --learning_rate 2e-7 \\\r\n   --beta 0.1 \\\r\n   --dataset $RANDOM_PAIRS_PATH \\\r\n   --chosen_key chosen \\\r\n   --rejected_key rejected \\\r\n   --flash_attn \\\r\n   --gradient_checkpointing \\\r\n   --nll_loss_coef 0.01 \\\r\n   --use_wandb WANDB_KEY \\\r\n   --wandb_run_name sweet_code_8b \\\r\n\r\n\r\n\r\n# carry out evaluations\r\npython scripts/simulate_interactions.py --agent_model $SAVE_PATH \\\r\n    --hostname host-of-human-simulator \\\r\n    --input_path /path/to/backend_tasks/test.jsonl \\ \\\r\n    --task_type code \\\r\n    --num_tasks 1000  --output_path $EVALUATION_PATH\r\n\r\npython scripts/evaluate_code.py $EVALUATION_PATH\r\n```\r\nYou should be able to see result similar to reported in the paper with a success rate around 40\\%.\r\n\r\n### Data on Frontend Design\r\nWe provide the same command where you can generate the offline data for Frontend Design yourself:\r\n```bash\r\npython scripts/simulate_interactions.py --agent_model /path/to/Llama-3.1-8B-Instruct \\\r\n    --task_type html \\\r\n    --num_tasks 1000 \\\r\n    --best_of_n 6 \\\r\n    ---train \\\r\n    --hostname xxx or localhost \\\r\n    --output_path /path/for/output/temp_test_html.jsonl\\\r\n    --input_path /your/data/path/frontend_tasks/train.jsonl \\\r\n    --env_model /path/to/Qwen2-VL-72B-Instruct \\\r\n    --to_continue\r\n```\r\n\r\n## License\r\nSWEET-RL is CC-By-NC licensed, as found in the LICENSE file.\r\n\r\n## Citation\r\nIf you find our benchmark or algorithm useful, please consider citing:\r\n```bibtex\r\n@misc{zhou2025sweetrltrainingmultiturnllm,\r\n      title={SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks}, \r\n      author={Yifei Zhou and Song Jiang and Yuandong Tian and Jason Weston and Sergey Levine and Sainbayar Sukhbaatar and Xian Li},\r\n      year={2025},\r\n      eprint={2503.15478},\r\n      archivePrefix={arXiv},\r\n      primaryClass={cs.LG},\r\n      url={https://arxiv.org/abs/2503.15478}, \r\n}\r\n```\r\n\r\n\r\n\r\n\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFacebookResearch%2Fsweet_rl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFacebookResearch%2Fsweet_rl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFacebookResearch%2Fsweet_rl/lists"}