{"id":13498769,"url":"https://github.com/xrsrke/instructGOOSE","last_synced_at":"2025-03-29T01:32:41.718Z","repository":{"id":65234380,"uuid":"582891530","full_name":"xrsrke/instructGOOSE","owner":"xrsrke","description":"Implementation of Reinforcement Learning from Human Feedback (RLHF)","archived":false,"fork":false,"pushed_at":"2023-04-07T08:45:59.000Z","size":3474,"stargazers_count":163,"open_issues_count":3,"forks_count":20,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-04-26T02:22:29.546Z","etag":null,"topics":["chatgpt","human-feedback","instructgpt","reinforcement-learning","rlhf"],"latest_commit_sha":null,"homepage":"https://xrsrke.github.io/instructGOOSE/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xrsrke.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-12-28T06:39:51.000Z","updated_at":"2024-04-16T07:17:11.000Z","dependencies_parsed_at":"2023-10-11T11:38:33.493Z","dependency_job_id":"2caf8f4b-cb8a-454b-87b7-75beb0a6cf2c","html_url":"https://github.com/xrsrke/instructGOOSE","commit_stats":{"total_commits":76,"total_committers":1,"mean_commits":76.0,"dds":0.0,"last_synced_commit":"5b9ac3f4a7a834b15a0771104f7d44cb07181c50"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xrsrke%2FinstructGOOSE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xrsrke%2FinstructGOOSE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xrsrke%2FinstructGOOSE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xrsrke%2FinstructGOOSE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xrsrke","download_url":"https://codeload.github.com/xrsrke/instructGOOSE/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222445438,"owners_count":16985838,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatgpt","human-feedback","instructgpt","reinforcement-learning","rlhf"],"created_at":"2024-07-31T21:00:43.277Z","updated_at":"2024-10-31T16:30:53.035Z","avatar_url":"https://github.com/xrsrke.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook","Models","数据集","Awesome RHLF"],"sub_categories":["Tools and Resources"],"readme":"InstructGoose\n================\n\n\u003c!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! --\u003e\n\n[\u003cimg src=\"https://img.shields.io/badge/license-MIT-blue\"\u003e](https://github.com/xrsrke/instructGOOSE)\n[![tests](https://github.com/vwxyzjn/cleanrl/actions/workflows/tests.yaml/badge.svg)](https://github.com/xrsrke/instructGOOSE/actions/workflows/tests.yaml)\n[![docs](https://img.shields.io/github/deployments/vwxyzjn/cleanrl/Production?label=docs\u0026logo=vercel.png)](https://xrsrke.github.io/instructGOOSE/)\n[![Code style:\nblack](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Open In\nColab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1HN1t75Jem9jXPzaOQdQO06OXMxinimNx?usp=sharing)\n\u003c!-- [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat\u0026labelColor=ef8336)](https://pycqa.github.io/isort/) --\u003e\n\nPaper: InstructGPT - [Training language models to follow instructions\nwith human feedback](https://arxiv.org/abs/2203.02155)\n\n![image.png](index_files/figure-commonmark/531cd4d9-1-image.png)\n\n## Install\n\nInstall from PipPy\n\n``` sh\npip install instruct-goose\n```\n\nInstall directly from the source code\n\n``` sh\ngit clone https://github.com/xrsrke/instructGOOSE.git\ncd instructGOOSE\npip install -e .\n```\n\n### How to Train\n\n**For reward model**\n\nUse 🤗 Accelerate to launch distributed training\n\n``` bash\naccelerate config\naccelerate launch scripts/train_reward.py\n```\n\n## Train the RL-based language model\n\n``` python\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nfrom datasets import load_dataset\n\nimport torch\nfrom torch.utils.data import DataLoader, random_split\nfrom torch import optim\n\nfrom instruct_goose import Agent, RewardModel, RLHFTrainer, RLHFConfig, create_reference_model\n```\n\n**Step 1:** Load dataset\n\n``` python\ndataset = load_dataset(\"imdb\", split=\"train\")\ndataset, _ = random_split(dataset, lengths=[10, len(dataset) - 10]) # for demenstration purposes\ntrain_dataloader = DataLoader(dataset, batch_size=4, shuffle=True)\n```\n\n    Found cached dataset imdb (/Users/education/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)\n\n**Step 2**: Load the pre-trained model and tokenizer\n\n``` python\nmodel_base = AutoModelForCausalLM.from_pretrained(\"gpt2\") # for demonstration purposes\nreward_model = RewardModel(\"gpt2\")\n\ntokenizer = AutoTokenizer.from_pretrained(\"gpt2\", padding_side=\"left\")\neos_token_id = tokenizer.eos_token_id\ntokenizer.pad_token = tokenizer.eos_token\n```\n\n**Step 3**: Create the RL-based language model agent and the reference\nmodel\n\n``` python\nmodel = Agent(model_base)\nref_model = create_reference_model(model)\n```\n\n**Step 4**: Train it\n\n``` python\nmax_new_tokens = 20\ngeneration_kwargs = {\n    \"min_length\":-1,\n    \"top_k\": 0.0,\n    \"top_p\": 1.0,\n    \"do_sample\": True,\n    \"pad_token_id\": tokenizer.eos_token_id,\n    \"max_new_tokens\": max_new_tokens\n}\n\nconfig = RLHFConfig()\nN_EPOCH = 1 # for demonstration purposes\ntrainer = RLHFTrainer(model, ref_model, config)\noptimizer = optim.SGD(model.parameters(), lr=1e-3)\n```\n\n``` python\nfor epoch in range(N_EPOCH):\n    for batch in train_dataloader:\n        inputs = tokenizer(batch[\"text\"], padding=True, truncation=True, return_tensors=\"pt\")\n        response_ids = model.generate(\n            inputs[\"input_ids\"], attention_mask=inputs[\"attention_mask\"],\n            **generation_kwargs\n        )\n\n        # extract the generated text\n        response_ids = response_ids[:, -max_new_tokens:]\n        response_attention_mask = torch.ones_like(response_ids)\n\n        # evaluate from the reward model\n        with torch.no_grad():\n            text_input_ids = torch.stack([torch.concat([q, r]) for q, r in zip(inputs[\"input_ids\"], response_ids)], dim=0)\n            rewards = reward_model(text_input_ids)\n\n        # calculate PPO loss\n        loss = trainer.compute_loss(\n            query_ids=inputs[\"input_ids\"],\n            query_attention_mask=inputs[\"attention_mask\"],\n            response_ids=response_ids,\n            response_attention_mask=response_attention_mask,\n            rewards=rewards\n        )\n        optimizer.zero_grad()\n        loss.backward()\n        optimizer.step()\n        print(f\"loss={loss}\")\n```\n\n    loss=-824.6560668945312\n    loss=0.030958056449890137\n    loss=4.284017562866211\n\n## TODO\n\n- Add support custom reward function\n- Add support custom value function\n- Add support non-transformer models\n- Write config class\n\n✅ Distributed training using 🤗 Accelerate\n\n## Resources\n\nI implemented this using these resources\n\n- Copied the\n  [`load_yaml`](https://xrsrke.github.io/instructGOOSE/utils.html#load_yaml)\n  function from https://github.com/Dahoas/reward-modeling\n- How to build a dataset to train reward model:\n  https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX–VmlldzozMzAwODM2\n- How to add value head in PPO agent: https://github.com/lvwerra/trl\n- How to calculate the loss of PPO agent:\n  https://github.com/lvwerra/trl/blob/main/trl/trainer/ppo_trainer.py\n- How to use PPO to train RLHF agent: https://github.com/voidful/TextRL\n- How PPO works:\n  https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py\n- Copied the compute `advantages` and `returns` from `TLR`:\n  https://github.com/lvwerra/trl/blob/d2e8bcf8373726fb92d2110c500f7df6d0bd566d/trl/trainer/ppo_trainer.py#L686\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxrsrke%2FinstructGOOSE","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxrsrke%2FinstructGOOSE","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxrsrke%2FinstructGOOSE/lists"}