{"id":13393603,"url":"https://github.com/CarperAI/trlx","last_synced_at":"2025-03-13T19:31:42.282Z","repository":{"id":60850823,"uuid":"545104023","full_name":"CarperAI/trlx","owner":"CarperAI","description":"A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)","archived":false,"fork":false,"pushed_at":"2024-01-08T20:07:19.000Z","size":47806,"stargazers_count":4592,"open_issues_count":98,"forks_count":480,"subscribers_count":50,"default_branch":"main","last_synced_at":"2025-03-07T20:18:22.684Z","etag":null,"topics":["machine-learning","pytorch","reinforcement-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CarperAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-03T19:42:40.000Z","updated_at":"2025-03-06T11:12:38.000Z","dependencies_parsed_at":"2023-09-23T03:52:25.237Z","dependency_job_id":"abf18ced-7a72-45e2-9b87-7c70c2e978ee","html_url":"https://github.com/CarperAI/trlx","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CarperAI%2Ftrlx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CarperAI%2Ftrlx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CarperAI%2Ftrlx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CarperAI%2Ftrlx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CarperAI","download_url":"https://codeload.github.com/CarperAI/trlx/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243469186,"owners_count":20295704,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","pytorch","reinforcement-learning"],"created_at":"2024-07-30T17:00:56.778Z","updated_at":"2025-03-13T19:31:41.600Z","avatar_url":"https://github.com/CarperAI.png","language":"Python","readme":"[![EMNLP Paper](https://img.shields.io/badge/EMNLP_Paper-grey.svg?style=flat\u0026logo=data:image/svg+xml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9Im5vIj8+CjwhLS0gQ3JlYXRlZCB3aXRoIElua3NjYXBlIChodHRwOi8vd3d3Lmlua3NjYXBlLm9yZy8pIC0tPgo8c3ZnCiAgIHhtbG5zOnN2Zz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciCiAgIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIKICAgdmVyc2lvbj0iMS4wIgogICB3aWR0aD0iNjgiCiAgIGhlaWdodD0iNjgiCiAgIGlkPSJzdmcyIj4KICA8ZGVmcwogICAgIGlkPSJkZWZzNCIgLz4KICA8cGF0aAogICAgIGQ9Ik0gNDEuOTc3NTUzLC0yLjg0MjE3MDllLTAxNCBDIDQxLjk3NzU1MywxLjc2MTc4IDQxLjk3NzU1MywxLjQ0MjExIDQxLjk3NzU1MywzLjAxNTggTCA3LjQ4NjkwNTQsMy4wMTU4IEwgMCwzLjAxNTggTCAwLDEwLjUwMDc5IEwgMCwzOC40Nzg2NyBMIDAsNDYgTCA3LjQ4NjkwNTQsNDYgTCA0OS41MDA4MDIsNDYgTCA1Ni45ODc3MDgsNDYgTCA2OCw0NiBMIDY4LDMwLjk5MzY4IEwgNTYuOTg3NzA4LDMwLjk5MzY4IEwgNTYuOTg3NzA4LDEwLjUwMDc5IEwgNTYuOTg3NzA4LDMuMDE1OCBDIDU2Ljk4NzcwOCwxLjQ0MjExIDU2Ljk4NzcwOCwxLjc2MTc4IDU2Ljk4NzcwOCwtMi44NDIxNzA5ZS0wMTQgTCA0MS45Nzc1NTMsLTIuODQyMTcwOWUtMDE0IHogTSAxNS4wMTAxNTUsMTcuOTg1NzggTCA0MS45Nzc1NTMsMTcuOTg1NzggTCA0MS45Nzc1NTMsMzAuOTkzNjggTCAxNS4wMTAxNTUsMzAuOTkzNjggTCAxNS4wMTAxNTUsMTcuOTg1NzggeiAiCiAgICAgc3R5bGU9ImZpbGw6I2VkMWMyNDtmaWxsLW9wYWNpdHk6MTtmaWxsLXJ1bGU6ZXZlbm9kZDtzdHJva2U6bm9uZTtzdHJva2Utd2lkdGg6MTIuODk1NDExNDk7c3Ryb2tlLWxpbmVjYXA6YnV0dDtzdHJva2UtbGluZWpvaW46bWl0ZXI7c3Ryb2tlLW1pdGVybGltaXQ6NDtzdHJva2UtZGFzaGFycmF5Om5vbmU7c3Ryb2tlLWRhc2hvZmZzZXQ6MDtzdHJva2Utb3BhY2l0eToxIgogICAgIHRyYW5zZm9ybT0idHJhbnNsYXRlKDAsIDExKSIKICAgICBpZD0icmVjdDIxNzgiIC8+Cjwvc3ZnPgo=)](https://aclanthology.org/2023.emnlp-main.530/) [![DOI](https://zenodo.org/badge/545104023.svg)](https://zenodo.org/badge/latestdoi/545104023)  [![License](https://img.shields.io/github/license/CarperAI/trlx)](LICENSE)\n\n# Transformer Reinforcement Learning X\n\ntrlX is a distributed training framework designed from the ground up to focus on fine-tuning large language models with reinforcement learning using either a provided reward function or a reward-labeled dataset.\n\nTraining support for 🤗 Hugging Face models is provided by [Accelerate](https://huggingface.co/docs/accelerate/)-backed trainers, allowing users to fine-tune causal and T5-based language models of up to 20B parameters, such as `facebook/opt-6.7b`, `EleutherAI/gpt-neox-20b`, and `google/flan-t5-xxl`. For models beyond 20B parameters, trlX provides [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)-backed trainers that leverage efficient parallelism techniques to scale effectively.\n\nThe following RL algorithms are currently implemented:\n\n| Algorithm                                                                     | Accelerate Trainer | NeMo Trainer  |\n|-------------------------------------------------------------------------------|:------------------:|:-------------:|\n| [Proximal Policy Optimization (PPO)](https://arxiv.org/pdf/1909.08593.pdf)    | ✅                 | ✅            |\n| [Implicit Language Q-Learning (ILQL)](https://sea-snell.github.io/ILQL_site/) | ✅                 | ✅            |\n\n📖 **[Documentation](https://trlX.readthedocs.io)**\n\n🧀 **[CHEESE](https://github.com/carperai/cheese)** Collect human annotations for your RL application with our human-in-the-loop data collection library.\n\n## Installation\n\n```bash\ngit clone https://github.com/CarperAI/trlx.git\ncd trlx\npip install torch --extra-index-url https://download.pytorch.org/whl/cu118\npip install -e .\n```\n\n## Examples\n\nFor more usage see [examples](./examples). You can also try the colab notebooks below:\n| Description | Link |\n| ----------- | ----------- |\n| Simulacra (GPT2, ILQL) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CarperAI/trlx/blob/main/examples/notebooks/trlx_simulacra.ipynb)|\n| Sentiment (GPT2, ILQL) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CarperAI/trlx/blob/main/examples/notebooks/trlx_sentiments.ipynb)|\n\nLatest runs of the examples are on our [Weights \u0026 Biases](https://wandb.ai/sorry/trlx-references/reportlist)\n\n## How to Train\n\nYou can train a model using a reward function or a reward-labeled dataset.\n\n#### Using a reward function\n\n```python\ntrainer = trlx.train('gpt2', reward_fn=lambda samples, **kwargs: [sample.count('cats') for sample in samples])\n```\n\nFor **reward model** training refer to our [autocrit](https://github.com/CarperAI/autocrit) library.\n\n#### Using a reward-labeled dataset\n\n```python\ntrainer = trlx.train('EleutherAI/gpt-j-6B', samples=['dolphins', 'geese'], rewards=[1.0, 100.0])\n```\n\n#### Using a prompt-completion dataset\n\n```python\ntrainer = trlx.train('gpt2', samples=[['Question: 1 + 2 Answer:', '3'], ['Question: Solve this equation: ∀n\u003e0, s=2, sum(n ** -s). Answer:', '(pi ** 2)/ 6']])\n```\n\n#### Trainers provide a wrapper over their underlying model\n\n```python\ntrainer.generate(**tokenizer('Q: Who rules the world? A:', return_tensors='pt'), do_sample=True)\n```\n\n#### Configure Hyperparameters\n\n```python\nfrom trlx.data.default_configs import default_ppo_config\n\nconfig = default_ppo_config()\nconfig.model.model_path = 'EleutherAI/gpt-neox-20b'\nconfig.tokenizer.tokenizer_path = 'EleutherAI/gpt-neox-20b'\nconfig.train.seq_length = 2048\n\ntrainer = trlx.train(config=config, reward_fn=lambda samples, **kwargs: [len(sample) for sample in samples])\n```\nTo reduce memory usage (if you're experiencing CUDA Out of Memory errors), first try the lowest setting for the following hyperparameters and eventually increase them:\n```python\n# micro batch size per gpu\nconfig.train.batch_size = 1\n# freeze all transformer layers\nconfig.model.num_layers_unfrozen = 0\n# maximum sample length, prompts or samples longer than that will be truncated\nconfig.train.seq_length = 128\n\n# micro batch size for sampling (specific for PPO)\nconfig.method.chunk_size = 1\n# use an additional Q-head (specific for ILQL)\nconfig.method.two_qs = False\n```\n\n#### Save the resulting model to a Hugging Face pretrained language model. (Ready to upload to the Hub!)\n\n```python\ntrainer.save_pretrained('/path/to/output/folder/')\n```\n\n#### Use 🤗 Accelerate to launch distributed training\n\n```bash\naccelerate config # choose DeepSpeed option\naccelerate launch examples/simulacra.py\n```\n\n#### Use NeMo-Megatron to launch distributed training\n\nFollow the setup instructions in the [NeMo README](./trlx/models/).\n\n```bash\npython examples/nemo_ilql_sentiments.py\n```\n\nFor more usage see the [NeMo README](./trlx/models)\n\n#### Use Ray Tune to launch hyperparameter sweep\n\n```bash\nray start --head --port=6379\npython -m trlx.sweep --config configs/sweeps/ppo_sweep.yml --accelerate_config configs/accelerate/ddp.yaml --num_gpus 4 examples/ppo_sentiments.py\n```\n\n#### Benchmark your trlX fork against trlX's `main` branch\n```bash\npython -m trlx.reference octocat/trlx-fork:fix-branch\n```\n\n## Logging\n\ntrlX uses the standard Python `logging` library to log training information to the console. The default logger is set to the `INFO` level, which means that `INFO`, `WARNING`, `ERROR`, and `CRITICAL` level messages will be printed to standard output.\n\nTo change the log level directly, you can use the verbosity setter. For example, to set the log level to `WARNING` use:\n\n```python\nimport trlx\n\ntrlx.logging.set_verbosity(trlx.logging.WARNING)\n```\n\nThis will suppress `INFO` level messages, but still print `WARNING`, `ERROR`, and `CRITICAL` level messages.\n\nYou can also control logging verbosity by setting the `TRLX_VERBOSITY` environment variable to one of the standard logging [level names](https://docs.python.org/3/library/logging.html#logging-levels):\n\n- `CRITICAL` (`trlx.logging.CRITICAL`)\n- `ERROR` (`trlx.logging.ERROR`)\n- `WARNING` (`trlx.logging.WARNING`)\n- `INFO` (`trlx.logging.INFO`)\n- `DEBUG` (`trlx.logging.DEBUG`)\n\n```sh\nexport TRLX_VERBOSITY=WARNING\n```\n\nBy default, [`tqdm`](https://tqdm.github.io/docs/tqdm/) progress bars are used to display training progress. You can disable them by calling `trlx.logging.disable_progress_bar()`, otherwise `trlx.logging.enable_progress_bar()` to enable.\n\nMessages can be formatted with greater detail by setting `trlx.logging.enable_explicit_format()`. This will inject call-site information into each log which may be helpful for debugging.\n\n```sh\n[2023-01-01 05:00:00,000] [INFO] [ppo_orchestrator.py:63:make_experience] [RANK 0] Message...\n```\n\n\u003e 💡 Tip: To reduce the amount of logging output, you might find it helpful to change log levels of third-party libraries used by trlX. For example, try adding `transformers.logging.set_verbosity_error()` to the top of your trlX scripts to silence verbose messages from the `transformers` library (see their [logging docs](https://huggingface.co/docs/transformers/main_classes/logging#logging) for more details).\n\n## Contributing\n\nFor development check out these [guidelines](./CONTRIBUTING.md)\nand also read our [docs](https://trlX.readthedocs.io)\n\n## Citing trlX\n\n```\n@inproceedings{havrilla-etal-2023-trlx,\n    title = \"trl{X}: A Framework for Large Scale Reinforcement Learning from Human Feedback\",\n    author = \"Havrilla, Alexander  and\n      Zhuravinskyi, Maksym  and\n      Phung, Duy  and\n      Tiwari, Aman  and\n      Tow, Jonathan  and\n      Biderman, Stella  and\n      Anthony, Quentin  and\n      Castricato, Louis\",\n    booktitle = \"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing\",\n    month = dec,\n    year = \"2023\",\n    address = \"Singapore\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2023.emnlp-main.530\",\n    doi = \"10.18653/v1/2023.emnlp-main.530\",\n    pages = \"8578--8595\",\n}\n```\n\n## Acknowledgements\n\nMany thanks to Leandro von Werra for contributing with [trl](https://github.com/lvwerra/trl/), a library that initially inspired this repo.\n","funding_links":[],"categories":["Python","Repos","Materials","Articles","A01_文本生成_文本对话","NLP","Industry Strength NLP","动手实现ChatGPT/RLHF","Codebases","tools","🚀 RL \u0026 LLM Fine-Tuning Repositories","Awesome RHLF","RLHF"],"sub_categories":["LM RLHF","GitHub repositories","Transformer Reinforcement Learning","大语言对话模型及数据","调用外部工具","2020 and before","glm 6b","Tools and Resources","LLMOps vs MLOps"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCarperAI%2Ftrlx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCarperAI%2Ftrlx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCarperAI%2Ftrlx/lists"}