{"id":16269760,"url":"https://github.com/sea-snell/implicit-language-q-learning","last_synced_at":"2025-05-08T00:40:38.125Z","repository":{"id":37578081,"uuid":"500210223","full_name":"Sea-Snell/Implicit-Language-Q-Learning","owner":"Sea-Snell","description":"Official code from the paper \"Offline RL for Natural Language Generation with Implicit Language Q Learning\"","archived":false,"fork":false,"pushed_at":"2023-07-31T20:08:32.000Z","size":1194,"stargazers_count":205,"open_issues_count":1,"forks_count":18,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-31T15:50:01.367Z","etag":null,"topics":["implicit-q-learning","iql","language-model","nlp","offline-rl","python","pytorch","q-learning","reinforcement-learning"],"latest_commit_sha":null,"homepage":"https://sea-snell.github.io/ILQL_site/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Sea-Snell.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-06-05T20:56:52.000Z","updated_at":"2025-03-04T16:52:25.000Z","dependencies_parsed_at":"2023-01-28T14:31:49.397Z","dependency_job_id":null,"html_url":"https://github.com/Sea-Snell/Implicit-Language-Q-Learning","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sea-Snell%2FImplicit-Language-Q-Learning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sea-Snell%2FImplicit-Language-Q-Learning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sea-Snell%2FImplicit-Language-Q-Learning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sea-Snell%2FImplicit-Language-Q-Learning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Sea-Snell","download_url":"https://codeload.github.com/Sea-Snell/Implicit-Language-Q-Learning/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252978668,"owners_count":21834910,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["implicit-q-learning","iql","language-model","nlp","offline-rl","python","pytorch","q-learning","reinforcement-learning"],"created_at":"2024-10-10T18:09:06.503Z","updated_at":"2025-05-08T00:40:38.102Z","avatar_url":"https://github.com/Sea-Snell.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Implicit Language Q Learning\n \nOfficial code from the paper \"Offline RL for Natural Language Generation with Implicit Language Q Learning\"\n \n[project site](https://sea-snell.github.io/ILQL_site/) | [arxiv](https://arxiv.org/abs/2206.11871)\n \n![A diagram of Implicit Language Q Learning](figures/ILQL_method_figure.png)\n \n# Setup\n \n### **Preprocessed Data and Reward Model**\n \nDownload `data.zip` and `outputs.zip` from the Google drive folder [here](https://drive.google.com/drive/folders/1ltO6e4sP3waGPJoGFGuiHt7mJt8_2eP3?usp=sharing). Place the downloaded and unzipped folders, `data/` and `outputs/`, at the root of the repo. `data/` contains the preprocessed data for all our tasks, and `outputs/` contains the checkpoint for our Reddit comments upvote reward.\n\n### **Dependencies and PYTHONPATH**\n \nThis repo was designed for python 3.9.7\n \n``` shell\npip install -r requirements.txt\nexport PYTHONPATH=\"$PWD/src/\"\n```\n \n### **Visual Dialogue Environment**\n \nTo run the Visual Dialogue experiments, you need to serve the Visual Dialogue environment on localhost by following the instructions [here](https://github.com/Sea-Snell/visdial-rl).\n \n### **Toxicity Filter Reward**\n \nTo run the Reddit comment experiments with the toxicity filter reward:\n \n1. create an account for the GPT-3 API [here](https://openai.com/api/)\n2. `export OPENAI_API_KEY=your_API_key`\n \n# Running Experiments\n \n`scripts/` contains all experiment scripts. To run any script in `scripts/`:\n1. Navigate to the script's directory.\n2. `python script_name.py`\n \nOptional:\n* Edit the config file corresponding to the script as you desire.\n* Provide commandline args [hydra](https://hydra.cc/docs/intro/) style like: `python script_name.py eval.bsize=5 train.lr=1e-6 wandb.use_wandb=false`\n* Run data parallel training or evaluation on multiple GPUs like: `python -m torch.distributed.launch --nproc_per_node [N_GPUs] --use_env script_name.py arg1=a arg2=b`\n \nBy default all training scripts log to wandb. To turn this off, set `wandb.use_wandb=false` in the training config.\n \n### **Recommended Experiment Workflow:**\n \nHere I outline a recommended workflow for training offline RL agents. Suppose that I want to train a bunch of different offline RL agents to generate Reddit comments with the toxicity reward.\n \nI would first train a BC model on the data:\n \n``` shell\ncd scripts/train/toxicity/\npython train_bc.py\n```\n \nThen convert this BC checkpoint into one compatible with the offline RL models:\n \n``` shell\ncd ../data/\npython convert_bc.py --load ../../outputs/toxicity/conditional_toxicity_official_bc_test1/model.pkl --save ../../outputs/toxicity/conditional_toxicity_official_bc_test1/model_converted.pkl\n```\n \nThen edit the checkpoint that offline RL is configured to train with:\n \n``` shell\ncd ../train/\npython train_iql.py model.load.checkpoint_path=outputs/toxicity/model_converted.pkl model.load.strict_load=false train.loss.awac_weight=0.0\n```\n \nThis is just one workflow though, you can also train the BC model at the same time as the offline RL agent by setting `train.loss.awac_weight=1.0` in the training config.\n \n# Repo Overview\n \n* All data is provided pre-processed in the `data/` folder.\n* `scripts/` contains all scripts for running training, evaluation, and data pre-processing steps in the paper. Scripts are organized into subfolders corresponding to the dataset used.\n* `config/` contains .yaml configs for each script. This repo uses [hydra](https://hydra.cc/docs/intro/) to manage configs. Configs are organized into subfolders corresponding to the dataset used. Most config files are named the same as their corresponding script, but if you are unsure which config corresponds to a script, check the line `@hydra.main(config_path=\"some_path\", config_name=\"some_name\")` to see which config file the script corresponds to.\n* `src/` contains all the core implementations. See `src/models/` for all model implementations. See `src/data/` for all base data processing and MDP abstraction code. See `src/utils/` for various utility functions. See `src/wordle/`, `src/visdial`, and `src/toxicity/` for all Wordle, Visual Dialogue, and Reddit comment dataset specific code respectively.\n* `ILQL` is referred to as `iql` throughout the repo.\n \n## Config Framework Overview\n \nEach script is associated with a config file. The config file specifies which models, dataset, and evaluators are to be loaded by the script and their corresponding hyperparameters. See `configs/toxicity/train_iql.yaml` for an example.\n \nEach possible model, dataset, or evaluator object is given its own config file, which specifies default values for that object and a special `name` attribute, which tells the config manager what class to load. See `configs/toxicity/model/per_token_iql.yaml` for an example.\n \nThe files `src/load_objects.py`, `src/wordle/load_objects.py`, `src/visdial/load_objects.py`, and `src/toxicity/load_objects.py` define how each object is loaded from its corresponding config. The `@register('name')` tag above each load object function links to the `name` attribute in the config.\n \nYou may notice a special `cache_id` attribute associated with some objects in a config. For an example, see `train_dataset` in `configs/toxicity/train_iql.yaml`. This attribute tells the config manager to cache the first object that it loads that is associated with this id, and then to return this cached object for subsequent object configs with this `cache_id`.\n \nFor all configs, use paths relative to the repo root.\n \n## A Few Abstrations to be Aware of\n \nEach of the tasks in our repo – Wordle, Visual Dialogue, and Reddit comments – implements a few base classes. Once implemented, all the offline RL algorithms can be applied to the task in a plug-and-play manner. See the \"Creating Your Own Tasks\" section for an overview of what should be implemented in order to create your own tasks. Below, we outline the key abstractions that make this possible.\n \n* `data.language_environment.Language_Environment` – represents a task POMDP environment, which a policy can interact with. It has a gym-like interface.\n* `data.language_environment.Policy` – represents a policy which can interact with an environment. Each of the offline RL algorithms in `src/models/` has a corresponding policy.\n* `data.language_environment.Language_Observation` – represents a text observation that is returned by the environment and given as input to a policy.\n* `data.language_environment.interact_environment` – a function which takes in an environment, a policy, and optionally the current observation and runs an environment interaction loop. If the current observation is not provided, it automatically fetches an initial state by resetting the environment.\n* `data.rl_data.DataPoint` – defines a standardized data format that is fed as input to all offline RL agents on all tasks. These data structures are created automatically from a given `Language_Observation`.\n* `data.rl_data.TokenReward` – defines a reward function given at every single token, which can be used for learning more fine grained control. This is provided on top of the environment's reward, which comes not at every token but instead after each turn of interaction. In all our experiments we set this reward to a constant 0, such that it has no effect.\n* `data.tokenizer.Tokenizer` – specifies how to convert strings to and from sequences of tokens which can then be fed as input to language models.\n* `data.rl_data.RL_Dataset` – defines a dataset object which returns `DataPoint` objects and is used for training offline RL agents. There are two versions of `RL_Dataset`:\n   1. `List_RL_Dataset`\n   2. `Iterable_RL_Dataset`\n \n# Wordle Task\n \n![A gif of ILQL playing Wordle](figures/wordle_gif.gif)\n \nHere we outline and document all the components of our Wordle task.\n \nMuch of what is in the example scripts is done automatically by the config manager, and the corresponding parameters can be edited by changing the configs. But if you want to bypass using the configs and use the Wordle task with your own codebase, you can reference the scripts and documentation below for how to do this.\n \n### **Playing Wordle:**\n \nA simple example script for playing Wordle in the commandline.\n \n``` python\nfrom wordle.wordle_env import WordleEnvironment\nfrom wordle.wordle_game import Vocabulary\nfrom wordle.policy import UserPolicy\nfrom data.language_environment import interact_environment\nfrom utils.misc import convert_path\n \ngame_vocab = Vocabulary.from_file(convert_path('data/wordle/word_lists/wordle_official.txt'))\nenv = WordleEnvironment(game_vocab)\npolicy = UserPolicy()\n \ninteract_environment(env, policy)\n```\n \n## Code Overview:\n \n* Wordle game implementation: `src/wordle/wordle_game.py`\n* Wordle gym-like environment: `src/wordle/wordle_env.py`\n* A set of handcrafted Wordle policies: `src/wordle/policy.py`\n* Dataset classes that load Wordle games from a file, sample games from a given policy, or load games from Twitter data: `src/wordle/wordle_dataset.py`\n \nTo make the game a valid MDP, the environment represents the underlying state as a set of known letter constraints, and uses these to filter the vocabulary for words that meet all of these constraints at each turn. A random word is then selected from this filtered word list and used to determine the color transitions returned by the environment. These new color transitions then update the set of known letter constraints.\n \n## Word Lists:\n \nThe Wordle environment takes in a word list. A few word lists are given in `data/wordle/word_lists/`, but feel free to make your own.\n \nThe word lists included are:\n* tweet_words.txt: a set of daily words corresponding to the days that the Wordle Tweets were scraped.\n* wordle_official.txt: the official word list for the pre-NYT version of the game. Taken from [here](https://gist.github.com/cfreshman/a03ef2cba789d8cf00c08f767e0fad7b).\n* wordle_official_200.txt: a random subset of 200 words from wordle_official.txt unioned with the words in tweet_words.txt, for retrofitting onto Tweet data.\n* wordle_official_400.txt: the same as wordle_official_200.txt, but with a random subset of 400 words instead.\n* wordle_official_800.txt: the same as wordle_official_200.txt, but with a random subset of 800 words instead.\n* wordle_official_guess.txt: the official list of allowable guess words in the pre-NYT version of the game.\n* 10k_words.txt: from [MIT's 10000 words list](https://www.mit.edu/~ecprice/wordlist.10000).\n* large_words.txt: a massive list of words taken from [here](https://github.com/dwyl/english-words/blob/master/words_alpha.txt).\n \n## Vocabulary:\n \nThe word lists are loaded into the environment through a `Vocabulary` object as in the example above.\n \n``` python\nfrom wordle.wordle_game import Vocabulary\nfrom utils.misc import convert_path\n \nvocab = Vocabulary.from_file(convert_path('data/wordle/word_lists/wordle_official.txt'))\n```\n \nThe vocabulary stores not just the word list, but also keeps track of a filtered list of words that meet all the known letter constraints in a given state. This list is used to compute transitions in the environment and is used by some of the hand crafted policies.\n \nProducing these filtered lists in real time can slow the environment interaction process. This shouldn't normally be an issue, but if you want to quickly synthesize lots of data from a policy, then this may become a bottleneck. To overcome this, all `Vocabulary` objects store a `cache` argument, which caches these filtered word lists associated with a given state. `vocab.cache.load(f_path)` and `vocab.cache.dump()` enables loading and saving this cache. For example, `data/wordle/vocab_cache_wordle_official.pkl` is a large cache for the wordle_official.txt word list.\n \nBeyond storing a cache, the `Vocabulary` object implements following methods in `src/wordle/wordle_game.py`:\n \n---\n \n#### **`__init__`**\n \n``` python\ndef __init__(self, all_vocab: List[str],\n            wordle_state: Optional[WordleState],\n            cache: Optional[Cache]=None,\n            fill_cache: bool=True) -\u003e None\n```\n**Inputs:**\n* `all_vocab: List[str]` – a list of words.\n* `wordle_state: Optional[WordleState]` – a state from which to generate the filtered word list, if no state is provided, no words are filtered.\n* `cache: Optional[Cache]=None` – a cache for the filtered vocab, as described above.\n* `fill_cache: bool=True` – whether to add to the cache.\n \n**Returns:** `None`\n \n#\n \n#### **`from_file`**\n \n``` python\ndef from_file(cls, vocab_file: str, fill_cache: bool=True) -\u003e Vocabulary\n```\n \n**Inputs:**\n* `vocab_file: str` – a file from which to load the words. The method only selects the words that are 5 letters long.\n* `fill_cache: bool=True` – whether to add to the cache.\n \n**Returns:** `Vocabulary`\n \n#\n \n#### **`filtered_vocab_size`**\n \n``` python\ndef filtered_vocab_size(self) -\u003e int\n```\n \n**Returns:** The size of the filtered vocabulary\n \n#\n \n#### **`all_vocab_size`**\n \n``` python\ndef all_vocab_size(self) -\u003e int\n```\n \n**Returns:** The size of the full unfiltered vocabulary\n \n#\n \n#### **`get_random_word_filtered`**\n \n``` python\ndef get_random_word_filtered(self) -\u003e str\n```\n \n**Returns:** A random word from the filtered list.\n \n#\n \n#### **`get_random_word_all`**\n \n``` python\ndef get_random_word_all(self) -\u003e str\n```\n \n**Returns:** A random word from the full unfiltered list.\n \n#\n \n#### **`update_vocab`**\n \n``` python\ndef update_vocab(self, wordle_state: WordleState) -\u003e Vocabulary\n```\n \n**Inputs:**\n* `wordle_state: WordleState` – a Wordle state object, representing the set of known letter constraints.\n \n**Returns:** A new `Vocabulary` object, which is filtered according to `wordle_state`.\n \n#\n \n#### **`__str__`**\n \n``` python\ndef __str__(self) -\u003e str\n```\n \n**Returns:** A string representation of the filtered word list for printing to the terminal.\n \n---\n \n## Wordle Environment:\n \n`WordleEnvironment` takes a Vocabulary object as input, which defines the set of possible correct words in the environment.\n \n``` python\nfrom wordle.wordle_env import WordleEnvironment\nfrom wordle.wordle_game import Vocabulary\nfrom utils.misc import convert_path\n \nvocab = Vocabulary.from_file(convert_path('data/wordle/word_lists/wordle_official.txt'))\nenv = WordleEnvironment(vocab)\n \ninitial_obs = env.reset()\nnext_obs, reward, terminal = env.step(\"snake\")\n```\n \nAs shown above, the environment implements a gym-like interface in `src/wordle/wordle_env.py`:\n \n---\n \n#### **`__init__`**\n \n``` python\ndef __init__(self, vocab: Vocabulary) -\u003e None\n```\n \n**Inputs:**\n* `vocab: Vocabulary` – the environment's vocabulary.\n \n**Returns:** `None`\n \n#\n \n#### **`step`**\n \n``` python\ndef step(self, action: str) -\u003e Tuple[WordleObservation, float, bool]\n```\n \n**Inputs:**\n* `action: Vocabulary` – a string of text representing an agent's action in the environment.\n \n**Returns:** an (observation, reward, terminal) tuple.\n \n#\n \n#### **`reset`**\n \n``` python\ndef reset(self) -\u003e WordleObservation\n```\n \n**Returns:** an observation.\n \n#\n \n#### **`is_terminal`**\n \n``` python\ndef is_terminal(self) -\u003e bool\n```\n \n**Returns:** a boolean indicating if the interaction has terminated.\n \n---\n \n## Hand Crafted Wordle Policies:\n \nWe implement a set of hand-crafted Wordle policies that cover a range of gameplay levels. All of these are implemented in `src/wordle/policy.py`. Here we describe each one:\n \n---\n \n#### **`UserPolicy`**\n \n``` python\nfrom wordle.policy import UserPolicy\n \npolicy = UserPolicy(hint_policy=None, vocab=None)\n```\n \n**Description:**\n \nLet's you play in the terminal.\n \n**Inputs:**\n* `hint_policy: Optional[Policy]` – another policy to query if you want a hint on what word to use.\n* `vocab: Optional[Union[str, Vocabulary]]` – a `Vocabulary` of guessable words. If not specified, any 5 letter sequence of chars is a valid guess.\n \n#\n \n#### **`StartWordPolicy`**\n \n``` python\nfrom wordle.policy import StartWordPolicy\n \npolicy = StartWordPolicy()\n```\n \n**Description:**\n \nTo be applied only for the first word. Selects a word randomly from a list of curated, high quality start words.\n \n**Inputs:**\n* `start_words: Optional[List[str]]=None` – override the curated list of start words.\n \n#\n \n#### **`OptimalPolicy`**\n \n``` python\nfrom wordle.policy import OptimalPolicy\n \npolicy = OptimalPolicy()\n```\n \n**Description:**\n \nMyopically plays the highest information gain word from the word list that meets all known letter constraints. This policy is not actually optimal, as [optimal play is NP-hard](https://arxiv.org/abs/2203.16713). But it plays at an extremely high level, and can be used as an approximate upper bound for performance. This policy is very slow to compute, with performance quadratic in the size of the word list; to save computations, `self.cache.load(f_path)` and `self.cache.dump()`allows you to load and save a cache. For example, `data/wordle/optimal_policy_cache_wordle_official.pkl` represents a cache for this policy on the `wordle_official.txt` word list.\n \n**Inputs:**\n* `start_word_policy: Optional[Policy]=None` – since the first word is generally the most expensive to compute information gain for, this allows you to specify a different policy to be called for just the first word.\n* `progress_bar: bool=False` – since it can take so long to compute, we leave you the option of displaying a progress bar for each call to `self.act`.\n \n#\n \n#### **`RepeatPolicy`**\n \n``` python\nfrom wordle.policy import RepeatPolicy\n \npolicy = RepeatPolicy(start_word_policy=None, first_n=2)\n```\n \n**Description:**\n \nRandomly repeats one of the `first_n` words already used. This is a maximally suboptimal policy, since it can never win unless it gets lucky on the first word.\n \n**Inputs:**\n* `start_word_policy: Optional[Policy]` – a policy to use for choosing the first word. If `None`, then randomly select a word from the environment's vocabulary.\n* `first_n: Optional[int]` – the policy randomly selects the next word from the `first_n` words in the history. If `None`, then it selects randomly from the full history.\n \n#\n \n#### **`RandomMixturePolicy`**\n \n``` python\nfrom wordle.policy import RandomMixturePolicy\n \npolicy = RandomMixturePolicy(prob_smart=0.5, vocab=None)\n```\n \n**Description:**\n \nChooses a word fully at random from a word list with probability `(1 - prob_smart)` and chooses a random word from the word list that meets all known letter constraints with probability `prob_smart`.\n \n**Inputs:**\n* `prob_smart: float` – the probability of selecting a word that meets all known letter constraints, rather than one fully at random.\n* `vocab: Optional[Union[str, Vocabulary]]` – a word list to select from. If `None`, then the policy defaults to the environment's word list.\n \n#\n \n#### **`WrongPolicy`**\n \n``` python\nfrom wordle.policy import WrongPolicy\nfrom wordle.wordle_game import Vocabulary\n \n \nvocab = Vocabulary.from_file('data/wordle/word_lists/wordle_official.txt')\npolicy = WrongPolicy(vocab)\n```\n \n**Description:**\n \nRandomly chooses a word from a word list that fails to meet all known letter constraints and thus cannot be the correct word. If all words in the word list meet the letter constraints, then it chooses a word at random from the list. This policy is highly suboptimal.\n \n**Inputs:**\n* `vocab: Union[str, Vocabulary]` – a word list to choose from.\n \n#\n \n#### **`MixturePolicy`**\n \n``` python\nfrom wordle.policy import MixturePolicy, OptimalPolicy, RandomMixturePolicy\n \npolicy1 = OptimalPolicy()\npolicy2 = RandomMixturePolicy(prob_smart=0.5, vocab=None)\npolicy = MixturePolicy(prob1=0.5, policy1=policy1, policy2=policy2)\n```\n \n**Description:**\n \nMixes two given policies. Select from `policy1` with probability `prob1` and select from `policy2` with probability `(1 - prob1)`.\n \n**Inputs:**\n* `prob1: float` – the probability of selecting an action from `policy1`.\n* `policy1: Policy` – the first policy to select actions from. Selected with probability `prob1`.\n* `policy1: Policy` – the second policy to select actions from. Selected with probability `(1 - prob1)`.\n \n#\n \n#### **`MonteCarloPolicy`**\n \n``` python\nfrom wordle.policy import MonteCarloPolicy\n \nsample_policy = RandomMixturePolicy(prob_smart=0.5, vocab=None)\npolicy = MonteCarloPolicy(n_samples=5, sample_policy=sample_policy)\n```\n \n**Description:**\n \nTakes in a policy, runs `n_samples` of Monte Carlo rollouts in the environment, and selects the next action which received the highest average reward during the rollout process.\n \n**Inputs:**\n* `n_samples: int` – the number of Monte Carlo rollouts to execute.\n* `sample_policy: Policy` – the policy to sample rollouts from.\n \n---\n \n## Synthetic Wordle Data\n \n![An example of a synthetic dataset](figures/SARSA_emperical.png)\n \nAny of the above policies can be used to generate datasets, which can be used to train offline RL agents. We implement, in `src/wordle/wordle_dataset.py`, two kinds of synthetic datasets:\n1. `wordle.wordle_dataset.WordleListDataset` – loads Wordle games from a file.\n2. `wordle.wordle_dataset.WordleIterableDataset` – samples Wordle games from a given policy.\n \n### **`WordleListDataset`:**\nLoad a Wordle dataset from a file like so:\n \n``` python\nfrom wordle.wordle_dataset import WordleListDataset\nfrom data.rl_data import ConstantTokenReward\n \ndata = WordleListDataset.from_file(\n   file_path='data/wordle/expert_wordle_100k.pkl',\n   max_len=None,\n   vocab=None,\n   token_reward=ConstantTokenReward(0.0),\n)\n \nfor i in range(data.size()):\n   item = data.get_item(i)\n```\n \n---\n \n#### **`__init__`**\n \n``` python\ndef __init__(self, items: List[Tuple[WordleObservation, Optional[Dict[str, Any]]]], max_len: Optional[int], token_reward: TokenReward) -\u003e None\n```\n \n**Inputs:**\n* `items: List[Tuple[WordleObservation, Optional[Dict[str, Any]]]]` – A list of data in the form of tuples of (WordleObservation, metadata_dict). Where metadata_dict is any sort of metadata is any sort of metadata you might want to store in the DataPoint.\n* `max_len: Optional[int]` – the maximum sequence length in the dataset, will truncate all token sequences to this length. If `None`, then sequences will not be truncated.\n* `token_reward: TokenReward` – the token-level reward to apply to the sequences. We use a constant reward of 0 per-token for all experiments.\n \n**Returns:** `None`\n \n#\n \n#### **`from_file`**\n \n``` python\ndef from_file(cls, file_path: str, max_len: Optional[int], vocab: Optional[Vocabulary], token_reward: TokenReward) -\u003e WordleListDataset\n```\n \n**Inputs:**\n* `file_path: str` – the path to the data pickle file.\n* `max_len: Optional[int]` – the maximum sequence length in the dataset, will truncate all token sequences to this length. If `None`, then sequences will not be truncated.\n* `vocab: Optional[Vocabulary]` – simulate the dataset under a different  environment vocabulary. If `None`, defaults to using the same vocabulary that was used to create the dataset.\n* `token_reward: TokenReward` – the token-level reward to apply to the sequences. We use a constant reward of 0 per-token for all experiments.\n \n**Returns:** a `WordleListDataset` object.\n \n#\n \n#### **`get_item`**\n \n``` python\ndef get_item(self, idx: int) -\u003e DataPoint\n```\n \n**Inputs:**\n* `idx: int` – an index in the dataset.\n \n**Returns:** a `DataPoint` object.\n \n#\n \n#### **`size`**\n \n``` python\ndef size(self) -\u003e int\n```\n \n**Returns:** the size of the dataset.\n \n---\n \nThe following scripts in `scripts/data/wordle/` can be used to synthesize Wordle data.\n \n| script      | description |\n| ----------- | ----------- |\n| `generate_data.py` | Samples a number of games from a given policy specified in the config and saves them to a file. |\n| `generate_data_mp.py` | The same as `generate_data.py` except samples games in parallel on multiple processes. |\n| `generate_adversarial_data.py` | synthesizes the dataset described in Section 5 of our paper, which was designed to demonstrate the difference between single-step RL methods and multi-step ones. |\n| `generate_adversarial_data_mp.py` | The same as `generate_adversarial_data.py` except samples games in parallel on multiple processes. |\n| `generate_data_branch.py` | Samples games from a given \"expert\" policy and then from each action in the game, a \"suboptimal\" policy branches off sampling a number of new games. |\n| `generate_data_branch_mp.py` | The same as `generate_data_branch.py` except samples games in parallel on multiple processes. |\n \n#\n \nSome provided synthetic Wordle datasets are in `data/wordle/`.\n \n| file      | description |\n| ----------- | ----------- |\n| `expert_wordle_100k_1.pkl` | 100k games sampled from `OptimalPolicy`. |\n| `expert_wordle_100k_2.pkl` | Another 100k games sampled from the `OptimalPolicy`. |\n| `expert_wordle_adversarial_20k.pkl` | The dataset described in Section 5 of our paper, which was designed to demonstrate the difference between single-step RL methods and multi-step ones. |\n| `expert_wordle_branch_100k.pkl` | 100k games sampled using `generate_data_branch.py` from `OptimalPolicy` with the branches sampled from `WrongPolicy`. |\n| `expert_wordle_branch_150k.pkl` | Another 150k games sampled using `generate_data_branch.py` from `OptimalPolicy` with the branches sampled from `WrongPolicy`. |\n| `expert_wordle_branch_2k_10sub.pkl` | 2k games sampled using `generate_data_branch.py` from `OptimalPolicy` with 10 branches per action sampled from `WrongPolicy`, such that there is much more suboptimal data than in `expert_wordle_branch_100k.pkl`. |\n| `expert_wordle_branch_20k_10sub.pkl` | The same as `expert_wordle_branch_2k_10sub.pkl` except 20k games instead of 2k games. |\n \n### **`WordleIterableDataset`:**\n \nGenerate Wordle data sampling from a policy like so:\n \n``` python\nfrom wordle.wordle_dataset import WordleIterableDataset\nfrom wordle.policy import OptimalPolicy\nfrom data.rl_data import ConstantTokenReward\n \npolicy = OptimalPolicy()\nvocab = Vocabulary.from_file('data/wordle/word_lists/wordle_official.txt')\ndata = WordleIterableDataset(\n   policy=policy,\n   vocab=vocab,\n   max_len=None,\n   token_reward=ConstantTokenReward(0.0),\n)\n \nwhile True:\n   item = data.sample_item()\n```\n \n---\n \n#### **`__init__`**\n \n``` python\ndef __init__(self, policy: Policy, vocab: Vocabulary, max_len: Optional[int], token_reward: TokenReward) -\u003e None\n```\n \n**Inputs:**\n* `policy: Policy` – a policy to sample from.\n* `vocab: Vocabulary` – the environment's vocabulary.\n* `max_len: Optional[int]` – the maximum sequence length in the dataset, will truncate all token sequences to this length. If `None`, then sequences will not be truncated.\n* `token_reward: TokenReward` – the token-level reward to apply to the sequences. We use a constant reward of 0 per-token for all experiments.\n \n**Returns:** `None`\n \n#\n \n#### **`sample_item`**\n \n``` python\ndef sample_item(self) -\u003e DataPoint\n```\n \n**Returns:** a `DataPoint` object.\n \n---\n \n## Wordle Tweet Data:\n \nWe have a large dataset of over 200k Tweets of Wordle games like this:\n \n\u003cimg src=\"figures/wordle_tweet.png\" height=\"45%\" width=\"45%\" style=\"display: block; margin-left: auto; margin-right: auto\"\u003e\n\u003c/br\u003e\n \nWe can retrofit Words onto these color transition squares to create a real dataset of Wordle games.\n \n### **Preprocessing the Tweet Data:**\n \nThe raw Tweet data is given in `data/wordle/tweets.csv`, but in order to be usable, actual words need to be retrofitted onto the color squares in the Tweets. Performing this retrofitting process requires executing a preprocessing script which caches all possible color transitions that could occur under the vocab lists: `guess_vocab` (a set of guessable words) and `correct_vocab` (a set of possible correct words in an environment). The result is a data structure that `wordle.wordle_dataset.WordleHumanDataset` uses to synthesize valid Wordle games from the Tweets. This script is `scripts/data/wordle/build_human_datastructure.py`. Call the script like:\n \n``` shell\ncd scripts/data/wordle/\npython build_human_datastructure.py --guess_vocab=../../../data/wordle/word_lists/wordle_official.txt --correct_vocab=../../../data/wordle/word_lists/wordle_official.txt --tweets_file=../../../data/wordle/tweets.csv --output_file=../../../data/wordle/random_human_tweet_data.json\n```\n \nThe script's args:\n* `--guess_vocab` specifies the set of guessable words.\n* `--correct_vocab` specifies the set of possible correct words in an environment.\n* `--tweets_file` specifies the raw csv file of Tweets\n* `--output_file` specifies where to dump the output.\n \n### **Loading the Tweet Data:**\n \nWe've run the preprocessing on some of the word lists, with the results saved in `data/wordle/`.\n \n| word list      | preprocessed Tweet data file |\n| ----------- | ----------- |\n| `wordle_official.txt` | `random_human_tweet_data.json` |\n| `wordle_official_800.txt` | `random_human_tweet_data_800.json` |\n| `wordle_official_400.txt` | `random_human_tweet_data_400.json` |\n| `wordle_official_200.txt` | `random_human_tweet_data_200.json` |\n| `tweet_words.txt` | `human_tweet_data_true_word.json` |\n \n \nGiven one of these files you can load the Wordle Tweet dataset like so:\n \n``` python\nfrom wordle.wordle_dataset import WordleHumanDataset\n \ndata = WordleHumanDataset.from_file('data/wordle/random_human_tweet_data_200.json')\n \nprint(data.sample_item())\n```\n \nWe used `'data/wordle/random_human_tweet_data_200.json'` in our experiments.\n \n### **`WordleHumanDataset`:**\n \n---\n \n#### **`__init__`**\n \n``` python\ndef __init__(self, games: List[Tuple[str, List[str]]], transitions: Dict[str, Dict[str, List[str]]], use_true_word: bool, max_len: Optional[int], token_reward: TokenReward, game_indexes: Optional[List[int]], top_p: Optional[float]) -\u003e None\n```\n \n**Inputs:**\n* `games: List[Tuple[str, List[str]]]` – a list of tuples of the form `(correct_wordle_word, wordle_transitions_list)`, where `wordle_transitions_list` is a list of transitions indicating the colors in the Tweet like: `[\"\u003cb\u003e\u003cb\u003e\u003cy\u003e\u003cy\u003e\u003cb\u003e\", \"\u003cg\u003e\u003cb\u003e\u003cb\u003e\u003cb\u003e\u003cb\u003e\", \"\u003cg\u003e\u003cg\u003e\u003cy\u003e\u003cb\u003e\u003cb\u003e\", \"\u003cg\u003e\u003cg\u003e\u003cg\u003e\u003cg\u003e\u003cg\u003e\"]`.\n* `transitions: Dict[str, Dict[str, List[str]]]` – a dict mapping the correct wordle word to another dict mapping possible color transitions that could have been induced by that word to a list of words that could have been played to cause that transition. This data structure is used to retrofit words onto the Tweets.\n* `use_true_word: bool` – if `True`, use the ground-truth correct word from the tweet, else retrofit any correct word in the word list that works.\n* `max_len: Optional[int]` – the maximum sequence length in the dataset, will truncate all token sequences to this length. If `None`, then sequences will not be truncated.\n* `token_reward: TokenReward` – the token-level reward to apply to the sequences. We use a constant reward of 0 per-token for all experiments.\n* `game_indexes: Optional[List[int]]` – a list of indexes to create a split of the Tweets. If `None`, all items in the data will be used. We have `data/wordle/human_eval_idxs.json` and `data/wordle/human_train_idxs.json` created as randomly selected train and eval splits.\n* `top_p: Optional[float]` – filter for the `top_p` performing percent of the data. If `None`, no data will be filtered. Used with %BC models.\n \n**Returns:** `None`\n \n#\n \n#### **`from_file`**\n \n``` python\ndef from_file(cls, file_path: str, use_true_word: bool=False, max_len: Optional[int]=None, token_reward: Optional[TokenReward]=None, top_p: Optional[float]=None) -\u003e WordleHumanDataset\n```\n \n**Inputs:**\n* `file_path: str` – the path to the json file to load the data from.\n* `use_true_word: bool` – if `True`, use the ground-truth correct word from the tweet, else retrofit any correct word in the word list that works.\n* `max_len: Optional[int]` – the maximum sequence length in the dataset, will truncate all token sequences to this length. If `None`, then sequences will not be truncated.\n* `token_reward: TokenReward` – the token-level reward to apply to the sequences. We use a constant reward of 0 per-token for all experiments.\n* `game_indexes: Optional[List[int]]` – a list of indexes to create a split of the Tweets. If `None`, all items in the data will be used. We have `data/wordle/human_eval_idxs.json` and `data/wordle/human_train_idxs.json` created as randomly selected train and eval splits.\n* `top_p: Optional[float]` – filter for the `top_p` performing percent of the data. If `None`, no data will be filtered. Used with %BC models.\n \n \n**Returns:** a `WordleHumanDataset` object.\n \n#\n \n#### **`sample_item`**\n \n``` python\ndef sample_item(self) -\u003e DataPoint\n```\n \n**Returns:** a `DataPoint` object.\n \n---\n \n## Wordle Training and Evaluation Scripts\n \nTraining scripts are in `scripts/train/wordle/`.\n \n| script      | description |\n| ----------- | ----------- |\n| `train_bc.py` | Train a BC agent. |\n| `train_iql.py` | Train an ILQL agent. |\n \nEvaluation scripts are in `scripts/eval/wordle/`.\n \n| script      | description |\n| ----------- | ----------- |\n| `eval_policy.py` | Evaluate a BC or ILQL agent in the Wordle environment. |\n| `eval_q_rank.py` | An evaluation script for comparing the relative rank of Q values for agents trained on the synthetic dataset described in Section 5 of our paper, which was designed to demonstrate a difference between single-step RL and multi-step RL. |\n| `distill_policy_eval.py` | Prints out the result of `eval_policy.py` with error bars. |\n \n# Visual Dialogue Question Asking Task\n \nHere we outline how to load the [Visual Dialogue](https://visualdialog.org) data in our codebase and how to execute the environment. See the setup section above for how to setup the remote components of the Visual Dialogue environment. The data and environment objects are loaded automatically by the config manager, but if you want to by-pass the config system and use the environment with your own codebase, here's how you should load, execute, and configure these objects. The same settings described below can all be modified in the configs as well.\n \n### **Loading the Visual Dialogue environment:**\n \nAn example of how to load the Visual Dialogue environment:\n \n``` python\nfrom visdial.visdial_env import VDEnvironment\nfrom visdial.visdial_base import VisDialogueData\nfrom visdial.visdial_dataset import VisDialListDataset\nfrom data.rl_data import ConstantTokenReward\nfrom utils.misc import convert_path\n \ndata = VisDialogueData(\n   data_path=convert_path('data/vis_dialogue/raw/visdial_0.5/visdial_0.5_train.json'),\n   img_feat_path=convert_path('data/vis_dialogue/processed/visdial_0.5/data_img.h5'),\n   split='train',\n   reward_cache=convert_path('data/vis_dialogue/processed/visdial_0.5/train_rank_reward_cache1.json'),\n   yn_reward_kind='none'\n)\n \nlist_data = VisDialListDataset(\n   data=data,\n   max_len=None,\n   token_reward=ConstantTokenReward(0.0)\n)\n \nenv = VDEnvironment(\n   dataset=list_data,\n   url='http://localhost:5000/step_rank',\n   yn_reward=-2.0,\n   yn_reward_kind='none'\n)\n \nprint(env.reset())\n```\n \nThe above script corresponds to how we configured the dataset and environment for our 'standard' reward experiments, but if you want to configure the dataset differently, there are many arguments you can modify. Beyond just changing the dataset split, these arguments can also change the task or reward. Below we describe all the different configurable parameters that `VisDialogueData`, `VisDialListDataset`, and `VDEnvironment` take.\n \n## Documentation:\n \nWe document the parameters and methods for `VisDialogueData`, `VisDialListDataset`, and `VDEnvironment`, so you know how to configure the environment yourself.\n \n### **`VisDialogueData`:**\n \n`VisDialogueData`, implemented in `src/visdial/visdial_base.py`, stores the task's set of dialogues and rewards.\n\n---\n \n#### **`__init__`**\n \n``` python\ndef __init__(self, data_path: str, img_feat_path: str, split: str, reward_cache: Optional[str]=None, norm_img_feats: bool=True, reward_shift: float=0.0, reward_scale: float=1.0, addition_scenes: Optional[List[Scene]]=None, mode: str='env_stops', cutoff_rule: Optional[CutoffRule]=None, yn_reward: float=-2.0, yn_reward_kind: str='none') -\u003e None\n```\n \n**Inputs:**\n* `data_path: str` – the path to the dialogue data. Should be one of:\n   1. `data/vis_dialogue/raw/visdial_0.5/visdial_0.5_train.json`\n   2. `data/vis_dialogue/raw/visdial_0.5/visdial_0.5_val.json`\n   3. `data/vis_dialogue/raw/visdial_0.5/visdial_0.5_test.json`\n* `img_feat_path: str` – the path to the image features used to compute the reward for each dialogue. Should always be `data/vis_dialogue/processed/visdial_0.5/data_img.h5`.\n* `split: str` – one of `train`, `val`, or `test`. Indicates which dataset split of the image features to use. Should be consistent with the `data_path` split.\n* `reward_cache: Optional[str]=None` – where the rewards for each dialogue are stored. If `None`, it will set all rewards to `None`. We provide caches for two reward functions:\n   1. The reward for the percentile-rank reward function we used in our paper is cached at: `data/vis_dialogue/processed/visdial_0.5/[split]_rank_reward_cache1.json`, where `[split]` is replaced by one of `train`, `val`, or `test`.\n   2. The euclidean distance based reward used by the paper [Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning](https://arxiv.org/abs/1703.06585) is cached at: `data/vis_dialogue/processed/visdial_0.5/[split]_reward_cache2.json`, where `[split]` is replaced by one of `train`, `val`, or `test`.\n* `norm_img_feats: bool=True` – whether to normalize the image features.\n* `reward_shift: float=0.0` – shift the reward by this amount.\n* `reward_scale: float=1.0` – scale the reward by this amount.\n* `addition_scenes: Optional[List[Scene]]=None` – Inject additional data into the dataset.\n* ` mode: str='env_stops'` – one of `['agent_stops', 'env_stops', '10_stop']`. Controls some properties of the task. We use `env_stops`\n   * If `mode='env_stops'`, then stop environment interaction early according to `cutoff_rule`.\n   * If `mode='agent_stops'`, then the agent stops interaction by generating a special `\u003cstop\u003e` token during its action; augments the data by placing a `\u003cstop\u003e` after every possible action.\n   * If `mode='10_stop'`, the play always stops after 10 rounds of interaction, as is standard in the Visual Dialogue dataset.\n* `cutoff_rule: Optional[CutoffRule]=None` – only applies if `mode='env_stops'`. Implements a function which determines when the environment should stop interaction early. We use the default of `visdial.visdial_base.PercentileCutoffRule(1.0, 0.5)` in all our experiments.\n* `yn_reward: float=-2.0` – the reward penalty that should be added for asking yes/no questions.\n* `yn_reward_kind: str='none'` – specifies the string match heuristic to be used for determining if a yes/no question was asked. Should be one of `['none', 'soft', 'hard', 'conservative']`.\n   * `'none'`: don't penalize yes/no questions. This corresponds to the `standard` reward in our paper.\n   * `'soft'`: penalize a question if the response contains `\"yes\"` or `\"no\"` as a substring.\n   * `'hard'`: penalize a question if the response matches exactly with the string `\"yes\"` or `\"no\"`. This corresponds to the `\"y/n\"` reward in our paper.\n   * `'conservative'`: penalize a question if the response satisfies one of several string matching heuristics. This corresponds to the `\"conservative y/n\"` reward in our paper.\n \n**Returns:** `None`\n \n#\n \n#### **`__len__`**\n \n``` python\ndef __len__(self) -\u003e int\n```\n \n**Returns:** the size of the dataset.\n \n#\n \n#### **`__getitem__`**\n \n``` python\ndef __getitem__(self, i: int) -\u003e Scene\n```\n \n**Inputs:**\n* `i: int` – the dataset index.\n \n**Returns:** an item from the dataset.\n \n---\n \n### **`VisDialListDataset`:**\n \n`VisDialListDataset`, implemented in `src/visdial/visdial_dataset.py`, wraps around `VisDialogueData` and converts it into a `DataPoint` format that can be used to train offline RL agents.\n \n---\n \n#### **`__init__`**\n \n``` python\ndef __init__(self, data: VisDialogueData, max_len: Optional[int], token_reward: TokenReward, top_p: Optional[float]=None, bottom_p: Optional[float]=None) -\u003e None\n```\n \n**Inputs:**\n* `data: VisDialogueData` – a Visual Dialogue data object that stores all the raw data.\n* `max_len: Optional[int]` – the maximum sequence length in the dataset, will truncate all token sequences to this length. If `None`, then sequences will not be truncated.\n* `token_reward: TokenReward` – the token-level reward to apply to the sequences. We use a constant reward of 0 per-token for all experiments.\n* `top_p: Optional[float]` – filter for the `top_p` performing percent of the data. If `None`, no data will be filtered. Used with %BC models.\n* `bottom_p: Optional[float]` – filter for the `bottom_p` performing percent of the data. If `None`, no data will be filtered.\n \n**Returns:** `None`\n \n#\n \n#### **`size`**\n \n``` python\ndef size(self) -\u003e int\n```\n \n**Returns:** the size of the dataset.\n \n#\n \n#### **`get_item`**\n \n``` python\ndef get_item(self, idx: int) -\u003e DataPoint\n```\n \n**Inputs:**\n* `i: int` – the dataset index.\n \n**Returns:** a `DataPoint` from the dataset.\n \n---\n \n### **`VDEnvironment`:**\n \n`VDEnvironment`, implemented in `src/visdial/visdial_env.py`, defines the Visual Dialogue environment, which our offline RL agents interact with at evaluation time. The environment involves connecting to a localhost server, which the Setup section describes how to spin up.\n \n---\n \n#### **`__init__`**\n \n``` python\ndef __init__(self, dataset: RL_Dataset, url: str, reward_shift: float=0.0, reward_scale: float=1.0, actor_stop: bool=False, yn_reward: float=-2.0, yn_reward_kind: str='none') -\u003e None\n```\n \n**Inputs:**\n* `dataset: RL_Dataset` – takes an `RL_Dataset`; specifically `VisDialListDataset`, as above. This dataset is used to select initial states.\n* `url: str` – the url for stepping the environment. Follow the instructions in the setup section for how to initialize the localhost webserver corresponding to this url.\n* `reward_shift: float=0.0` – shift the reward by this amount.\n* `reward_scale: float=1.0` – scale the reward by this amount.\n* `actor_stop: bool=False` – allow the actor to stop interaction early by generating a special `\u003cstop\u003e` token.\n* `yn_reward: float=-2.0` – the reward penalty that should be added for asking yes/no questions.\n* `yn_reward_kind: str='none'` – specifies the string match heuristic to be used for determining if a yes/no question was asked. Should be one of `['none', 'soft', 'hard', 'conservative']`.\n   * `'none'`: don't penalize yes/no questions. This corresponds to the `standard` reward in our paper.\n   * `'soft'`: penalize a question if the response contains `\"yes\"` or `\"no\"` as a substring.\n   * `'hard'`: penalize a question if the response matches exactly with the string `\"yes\"` or `\"no\"`. This corresponds to the `\"y/n\"` reward in our paper.\n   * `'conservative'`: penalize a question if the response satisfies one of several string matching heuristics. This corresponds to the `\"conservative y/n\"` reward in our paper.\n \n**Returns:** `None`\n \n#\n \n#### **`step`**\n \n``` python\ndef step(self, action: str) -\u003e Tuple[WordleObservation, float, bool]\n```\n \n**Inputs:**\n* `action: Vocabulary` – the environment's vocabulary\n \n**Returns:** an (observation, reward, terminal) tuple.\n \n#\n \n#### **`reset`**\n \n``` python\ndef reset(self) -\u003e WordleObservation\n```\n \n**Returns:** an observation\n \n#\n \n#### **`is_terminal`**\n \n``` python\ndef is_terminal(self) -\u003e bool\n```\n \n**Returns:** a boolean indicating if the interaction has terminated.\n \n---\n \n## Visual Dialogue Training and Evaluation Scripts\n \nTraining scripts are in `scripts/train/vis_dial/`.\n \n| script      | description |\n| ----------- | ----------- |\n| `train_bc.py` | Train a BC agent. |\n| `train_chai.py` | Train a CHAI agent. |\n| `train_cql.py` | Train a CQL agent. |\n| `train_dt.py` | Train a decision transformer agent. |\n| `train_iql.py` | Train an ILQL agent. |\n| `train_psi.py` | Train an $\\psi$-learning agent. |\n| `train_utterance.py` | Train an utterance-level ILQL agent. |\n \nEvaluation scripts are in `scripts/eval/vis_dial/`.\n \n| script      | description |\n| ----------- | ----------- |\n| `eval_policy.py` | Evaluate an agent in the Visual Dialogue environment. |\n| `top_advantage.py` | Finds the questions which have the greatest and the smallest advantage under the model. |\n| `distill_policy_eval.py` | Prints out the result of `eval_policy.py` with error bars. |\n \n# Reddit Comment Task\n \nHere we outline how to load the Reddit comments data in our codebase and how to execute the environment. See the setup section above for how to setup the toxicity filter reward. The data and environment objects are loaded automatically by the config manager, but if you want to by-pass the config system and use the task with your own codebase, here's how you should load, execute, and configure these objects. The same settings described below can all be modified in the configs as well.\n \n### **Loading the Reddit comment environment:**\n \nAn example of how to load the Reddit comment environment:\n \n``` python\nfrom toxicity.toxicity_env import ToxicityEnvironment\nfrom toxicity.reddit_comments_base import RedditData\nfrom toxicity.reward_fs import toxicity_reward\nfrom utils.misc import convert_path\n \nidxs = json.load(open(convert_path('data/reddit_comments/train_idxs.json'), 'r'))\n \ndata = RedditData(\n   path=convert_path('data/reddit_comments/'),\n   indexes=idxs,\n   reward_f=toxicity_reward\n)\n \nenv = ToxicityEnvironment(\n   data=data,\n   reward_f=toxicity_reward\n)\n \nprint(env.reset())\n \n```\n \nThe above script corresponds to how we configured the environment for our toxicity reward experiments, but if you want to configure the environment differently, there are a few arguments you can modify. These arguments can also change the task or reward. Below we describe all the different configurable parameters that our reward functions, `RedditData`, `ToxicityListDataset`, and `ToxicityEnvironment` take.\n \n## Documentation\n \nWe document the parameters and methods for our different Reddit comment reward functions, `RedditData`, `ToxicityListDataset`, and `ToxicityEnvironment`, so that you know how to configure the environment yourself.\n \n### **reward functions:**\n \nHere we outline the 4 main reward functions we use for our Reddit comment task. Each of these rewards is implemented in `src/toxicity/reward_fs.py`.\n \n---\n \n#### **`toxicity_reward`**\n \n``` python\nfrom toxicity.reward_fs import toxicity_reward\n \nreward_f = toxicity_reward()\n```\n \n**Description:**\n \nThe \"toxicity\" reward from our paper, which queries the GPT-3 toxicity filter. It assigns a value of \"0\" to non-toxic comments, a value of \"1\" to moderately toxic comments, and a value of \"2\" to very toxic comments.\n \n#\n \n#### **`toxicity_noised_reward`**\n \n``` python\nfrom toxicity.reward_fs import toxicity_noised_reward\n \nreward_f = toxicity_noised_reward()\n```\n \n**Description:**\n \nThe \"noised toxicity\" reward from our paper, which is the same as `toxicity_noised_reward` but induces additional noise. Specifically, it re-assigns comments labeled as \"1\" (moderately toxic) to either \"0\" (non-toxic) or \"2\" (extremely toxic) with equal probability.\n \n#\n \n#### **`score_human_reward`**\n \n``` python\nfrom toxicity.reward_fs import score_human_reward\nfrom utils.misc import convert_path\n \nreward_f = score_human_reward(\n   reddit_path=convert_path('data/reddit_comments/'),\n   indexes=None\n)\n```\n \n**Description:**\n \nThe \"upvotes real\" reward from our paper, which gives a reward of +1 for positive upvote comments and -1 for negative upvote comments. This uses the ground truth upvotes in the data, so it only applies to comments in the dataset and cannot be used for evaluation. If you input a string not present in the data, it will error. The arguments to this function specify what data to load.\n \n**Inputs:**\n* `reddit_path: str` – a path to the data.\n* `indexes: List[int]` – a split of indexes in the data to use. If `None`, it considers all the data.\n \n#\n \n#### **`model_reward`**\n \n``` python\nfrom toxicity.reward_fs import score_human_reward\nfrom toxicity.reddit_comments_base import RedditData\nfrom toxicity.toxicity_dataset import ToxicityListDataset\nfrom toxicity.reward_model import RobertaBinaryRewardModel\nfrom utils.rl_data import ConstantTokenReward\nfrom utils.misc import convert_path\n \ndata = RedditData(\n   path=convert_path('data/reddit_comments/'),\n   indexes=None,\n   reward_f=None\n)\n \nlistdata = ToxicityListDataset(\n   data=data,\n   max_len=512,\n   token_reward=ConstantTokenReward(0.0)\n)\n \nmodel = RobertaBinaryRewardModel(\n   data=listdata,\n   device='cuda',\n   roberta_kind='roberta-base',\n   freeze_roberta=False,\n   reward_cuttoff=0.0\n)\n \nmodel.load_state_dict(torch.load(convert_path('outputs/toxicity/upvote_reward/model.pkl'), map_location='cpu'))\n \nreward_f = score_human_reward(model=model)\n```\n \n**Description:**\n \nThe \"upvotes model\" reward from our paper, which gives a reward of +1 if the given model predicts that the comment will get a positive number of upvotes and a reward of -1 otherwise. The model checkpoint we used for our experiments is at: `outputs/toxicity/upvote_reward/model.pkl`\n \n**Inputs:**\n* `model: RewardModel`: the reward model implemented in `src/toxicity/reward_model.py`. The model should be first trained and loaded from a pytorch checkpoint.\n \n---\n \n### **`RedditData`:**\n \n`RedditData`, implemented in `src/toxicity/reddit_comments_base.py`,  stores the raw Reddit comments data.\n \n---\n \n#### **`__init__`**\n \n``` python\ndef __init__(self, path: str, indexes: Optional[List[int]], reward_f: Optional[Callable[[str], float]], reward_cache: Optional[Cache]=None, reward_shift: float=0.0, reward_scale: float=1.0) -\u003e None\n```\n \n**Inputs:**\n \n* `path: str` – the path to the Reddit data.\n* `indexes: Optional[List[int]]` – a list of indexes to create a split of the data. Randomly selected, training, validation, and test splits are in the json files:\n   * `data/reddit_comments/train_idxs.json`\n   * `data/reddit_comments/eval_idxs.json`\n   * `data/reddit_comments/test_idxs.json`\n* `reward_f: Optional[Callable[[str], float]]` – the reward function to use.\n* `reward_cache: Optional[Cache]=None` – a cache of reward values, so you don't have to recompute them everytime.\n* `reward_shift: float=0.0` – shift the reward by this amount.\n* `reward_scale: float=1.0` – scale the reward by this amount.\n \n**Returns:** `None`\n \n#\n \n#### **`__len__`**\n \n``` python\ndef __len__(self) -\u003e int\n```\n \n**Returns:** the size of the dataset.\n \n#\n \n#### **`__getitem__`**\n \n``` python\ndef __getitem__(self, idx: int) -\u003e Scene\n```\n \n**Inputs:**\n* `idx: int` – the dataset index.\n \n**Returns:** an item from the dataset.\n \n---\n \n### **`ToxicityListDataset`:**\n \n`ToxicityListDataset`, implemented in `src/toxicity/toxicity_dataset.py`, wraps around `RedditData` and converts it into a `DataPoint` format that can be used to train offline RL agents.\n \n---\n \n#### **`__init__`**\n \n``` python\ndef __init__(self, data: RedditData, max_len: Optional[int], token_reward: TokenReward, cuttoff: Optional[float]=None, resample_timeout: float=0.0, include_parent: bool=True) -\u003e None\n```\n \n**Inputs:**\n \n* `data: RedditData` – a Reddit comment data object that stores all the raw data.\n* `max_len: Optional[int]` – the maximum sequence length in the dataset, will truncate all token sequences to this length. If `None`, then sequences will not be truncated.\n* `token_reward: TokenReward` – the token-level reward to apply to the sequences. We use a constant reward of 0 per-token for all experiments.\n* `cuttoff: Optional[float]=None` – filter out all comments from the dataset with reward less than `cuttoff`. If `None`, no data will be filtered. Used with %BC models.\n* `resample_timeout: float=0.0` – when `cuttoff` is not equal to `None`, comments are stochastically sampled i.i.d. from the dataset, like an iterable, even though the dataset has a list-type interface. It uniformly re-samples from the dataset until it finds a comment with a reward that satisfies the cuttoff. In the case of the \"toxicity\" reward, this re-sampling can cause rate-limit errors on the GPT-3 API, so we allow you to add a `resample_timeout` to fix this issue: a timeout of roughly 0.05 should fix rate-limit issues.\n* `include_parent: bool=True` – whether to condition on the parent comment in the thread. If `False`, models will be trained to generate comments unconditionally.\n \n**Returns:** `None`\n \n#\n \n#### **`size`**\n \n``` python\ndef size(self) -\u003e int\n```\n \n**Returns:** the size of the dataset.\n \n#\n \n#### **`get_item`**\n \n``` python\ndef get_item(self, idx: int) -\u003e DataPoint\n```\n \n**Inputs:**\n* `i: int` – the dataset index.\n \n**Returns:** a `DataPoint` from the dataset.\n \n---\n \n### **`ToxicityEnvironment`:**\n \n`ToxicityEnvironment`, implemented in `src/toxicity/toxicity_env.py`, defines the Reddit comment generation environment, which our offline RL agents interact with at evaluation time.\n \n#\n \n#### **`__init__`**\n \n``` python\ndef __init__(self, data: RedditData, reward_f: Optional[Callable[[str], float]], reward_shift: float=0.0, reward_scale: float=1.0, include_parent: bool=True) -\u003e None\n```\n \n**Inputs:**\n \n* `data: RedditData` – the dataset used to select initial state parent comments to condition on.\n* `reward_f: Optional[Callable[[str], float]]` – the reward function to use.\n* `reward_shift: float=0.0` – shift the reward by this amount.\n* `reward_scale: float=1.0` – scale the reward by this amount.\n* `include_parent: bool=True` – specifies whether to condition on the previous comment or post in the Reddit thread.\n \n**Returns:** `None`\n \n#\n \n#### **`step`**\n \n``` python\ndef step(self, action: str) -\u003e Tuple[WordleObservation, float, bool]\n```\n \n**Inputs:**\n* `action: Vocabulary` – the environment's vocabulary\n \n**Returns:** an (observation, reward, terminal) tuple.\n \n#\n \n#### **`reset`**\n \n``` python\ndef reset(self) -\u003e WordleObservation\n```\n \n**Returns:** an observation\n \n#\n \n#### **`is_terminal`**\n \n``` python\ndef is_terminal(self) -\u003e bool\n```\n \n**Returns:** a boolean indicating if the interaction has terminated.\n \n---\n \n## Reddit comment Training and Evaluation Scripts\n \nTraining scripts are in `scripts/train/toxicity/`.\n \n| script      | description |\n| ----------- | ----------- |\n| `train_bc.py` | Train a BC agent. |\n| `train_iql.py` | Train an ILQL agent. |\n| `train_upvote_reward.py` | Train the upvote reward model. |\n \nEvaluation scripts are in `scripts/eval/toxicity/`.\n \n| script      | description |\n| ----------- | ----------- |\n| `eval_policy.py` | Evaluate an agent in the Reddit comments environment. |\n| `distill_policy_eval.py` | Prints out the result of `eval_policy.py` with error bars. |\n \n# Creating Your Own Tasks\n \nAll tasks – Wordle, Visual Dialogue, Reddit – have a corresponding environment and dataset implemented in the codebase, as described above. And all offline RL algorithms in the codebase are trained, executed, and evaluated on one of these given environments and datasets.\n\nYou can similarly define your own tasks that can easily be run on all these offline RL algorithms. This codebase implements a simple set of RL environment abstractions that make it possible to define your own environments and datasets that can plug-and-play with any of the offline RL algorithms.\n\nAll of the core abstractions are defined in `src/data/`. Here we outline what needs to be implemented in order to create your own tasks. For examples, see the implementations in `src/wordle/`, `src/vis_dial/`, and `src/toxicity/`.\n \n## 1. Create an environment and define observations:\n \nAll tasks must implement subclasses of: `Language_Observation` and `Language_Environment`, which are in `src/data/language_environment.py`.\n \n### **`Language_Observation`:**\nThis class represents the observations from the environment that will be input to your language model.\n \nA `Language_Observation` must define the following two functions.\n \n---\n \n#### **`to_sequence`**\n \n``` python\ndef to_sequence(self) -\u003e Tuple[List[str, Optional[float]], bool]:\n```\n \n**Description:**\n \nA function which converts the observation object into a standard format that can be input to the language model and used for training.\n \n**Returns:**\n1. a list of (utterance, reward) tuples. The tuples are meant to represent alternating environment interactions: your agent's utterance and the environment's response. Utterances corresponding to the environment response should have reward=None, and those corresponding to the agent's utterances should have reward=some_float.\n2. a boolean indicating whether this observation is the last one in the interaction.\n \n#\n \n#### **`__str__`**\n \n``` python\ndef __str__(self) -\u003e str:\n```\n \n**Description:**\n \nThis is only used to print the observation to the terminal. It should convert the observation into some kind of string that is interpretable by a user.\n \n**Returns:** a string.\n \n---\n \n### **`Language_Environment`:**\nThis class represents a gym-style environment for online interaction, which is only used for evaluation.\n \nA Language_Environment must define the following three functions.\n \n---\n \n#### **`step`**\n \n``` python\ndef step(self, action: str) -\u003e Tuple[Language_Observation, float, bool]:\n```\n \n**Description:**\n \nJust like a standard gym environment, given an action in the form of a string, step the environment forward.\n \n**Returns:** a tuple of (Language_Observation, reward, terminal).\n \n#\n \n#### **`reset`**\n \n``` python\ndef reset(self) -\u003e Language_Observation:\n```\n \n**Description:**\n \nThis resets the environment to an initial state.\n \n**Returns:** the corresponding initial `Language_Observation`\n \n#\n \n#### **``is_terminal``**\n \n``` python\ndef is_terminal(self) -\u003e bool:\n```\n \n**Description:**\n \nOutputs whether the environment has reached a terminal state.\n \n**Returns:** a boolean indicating if the environment has reached a terminal state.\n \n---\n \n### **2. Create a Dataset:**\n \nAll tasks must implement subclasses of either `List_RL_Dataset` or `Iterable_RL_Dataset` or both, which are defined in `src/data/rl_data.py`.\n \n### **`List_RL_Dataset`:**\nThis class represents a list dataset (or an indexable dataset of finite length) that can be used to train offline RL agents.\n \nA `List_RL_Dataset` must define the following two functions.\n \n---\n \n#### **`get_item`**\n \n``` python\ndef get_item(self, idx: int) -\u003e DataPoint\n```\n \n**Description:**\n \nThis gets an item from the dataset at a given index.\n \n**Returns:** a `DataPoint` object from the dataset.\n \n#\n \n#### **``size``**\n \n``` python\ndef size(self) -\u003e int\n```\n \n**Description:**\n \nReturns the size of the dataset.\n \n**Returns:** the dataset's size.\n \n---\n \n### **`Iterable_RL_Dataset`:**\nThis class represents an iterable dataset (or a non-indexable dataset that stochastically samples datapoints i.i.d.) that can be used to train offline RL agents.\n \nA `Iterable_RL_Dataset` must define the following function.\n \n---\n \n#### **``sample_item``**\n \n``` python\ndef sample_item(self) -\u003e DataPoint\n```\n \n**Description:**\n \nSamples a datapoint from the dataset.\n \n**Returns:** a `DataPoint` object from the dataset.\n \n---\n \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsea-snell%2Fimplicit-language-q-learning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsea-snell%2Fimplicit-language-q-learning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsea-snell%2Fimplicit-language-q-learning/lists"}