Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/clvoloshin/COBS

OPE Tools based on Empirical Study of Off Policy Policy Estimation paper.
https://github.com/clvoloshin/COBS

Last synced: 2 months ago
JSON representation

OPE Tools based on Empirical Study of Off Policy Policy Estimation paper.

Host: GitHub
URL: https://github.com/clvoloshin/COBS
Owner: clvoloshin
Created: 2019-11-14T19:48:38.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2022-08-09T17:19:51.000Z (over 2 years ago)
Last Synced: 2024-08-04T02:09:10.099Z (6 months ago)
Language: Python
Size: 13.6 MB
Stars: 60
Watchers: 3
Forks: 14
Open Issues: 4
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-model-based-RL - ope-tools
awesome-offline-rl - COBS: Caltech OPE Benchmarking Suite

README

# Caltech OPE Benchmarking Suite (COBS)

## Introduction

COBS is an Off-Policy Policy Evaluation (OPE) Benchmarking Suite. The goal is to provide fine experimental control to carefully tease out an OPE method's performance across many key conditions.

We'd like to make this repo as useful as possible for the community. We commit to continual refactoring and code-review to make sure the COBS continues to serve its purpose. Help is always appreciated!

COBS is based on Empirical Study of Off Policy Policy Estimation paper (https://arxiv.org/abs/1911.06854).

## Getting started

### Tutorial

To get started using the experimental tools see [Tutorial.ipynb](https://github.com/clvoloshin/COBS/blob/master/Tutorial.ipynb). For individual files, see [example_tabular.py](https://github.com/clvoloshin/COBS/blob/master/example_tabular.py) and [example_nn.py](https://github.com/clvoloshin/COBS/blob/master/example_nn.py).

To run [example_tabular.py](https://github.com/clvoloshin/COBS/blob/master/example_tabular.py):
```
python3 example_tabular.py tabular_example_cfg.json
```

To run [example_nn.py](https://github.com/clvoloshin/COBS/blob/master/example_nn.py):
```
python3 example_nn.py nn_example_cfg.json
```

### Paper Reproducibility

We have migrated from Tensorflow to Pytorch and made COBS more easy to use. For the original TF implementation and replication of the paper, please see the paper branch. Run paper.py using the instructions provided at the bottom of the file.

### Installation

Tested on python3.6+.
```
python3 -m venv cobs-env
source cobs-env/bin/activate
pip3 install -r requirements.txt
```

## Experiment Configuration

See an example experiment
The experiment section
```
"experiment": {
"gamma": 0.98,
"horizon": 5,
"base_policy": 0.8,

"eval_policy": 0.2,

"stochastic_env": true,
"stochastic_rewards": false,
"sparse_rewards": false,
"num_traj": 8,
"is_pomdp": false,
"pomdp_horizon": 2,
"seed": 1000,
"experiment_number": 0,
"access": 0,
"secret": 0,
"to_regress_pi_b": {
"to_regress": false,
"model": "defaultCNN",

"max_epochs": 100,
"batch_size": 32,
"clipnorm": 1.0
},
"frameskip": 1,
"frameheight": 1
},
```
and the models section
```
"models": {
"FQE": {
"model": "defaultCNN",

"convergence_epsilon":
"max_epochs": 100,
"batch_size": 32,
"clipnorm": 1.0
},
"Retrace": {
"model":
"convergence_epsilon":
"max_epochs": 3,
"batch_size": 32,
"clipnorm": 1.0,
"lamb": 0.9
},
...
``` [configuration](https://github.com/clvoloshin/COBS/blob/master/cfgs/nn_example_cfg.json). The configuration contains two parts, the experiment section and the models section. The experiment section is used to instatiate the environment and general parameters. The models section is used to specify which methods to use and their specific parameters. looks like: # discount factor # horizon of the environment # Probability of deviation from greedy for base policy. # Note: This parameter means different things depending on the type of policy # Probability of deviation from greedy for evaluation policy. # Note: This parameter means different things depending on the type of policy # Make environment have stochastic transitions # Make environment have stochastic rewards # Make environment have sparse rewards # Number of trajectories to collect from base_policy/behavior_policy (pi_b) # Make the environment a POMDP # POMDP horizon, if POMDP is true # Seed # Label for experiment. Used for distributed compute # Credentials for AWS. Used for distributed compute # Credentials for AWS. Used for distributed compute # Should we regress pi_b? Is it unknown? # What model to fit pi_b with # Note: To add your own, see later in the README.md # Max number of fitting iterations # Minibatch size # Gradient clip # (x_t, a, r, x_{t+frameskip}). Apply action "a" frameskip number of times # (x_{t:t+frameheight}, a, r, x_{t+1:t+1+frameheight}). State is consider a concatenation of frameheight number of states (TODO: rename to methods section) looks like: # What model to fit FQE with # Note: To add your own, see later in the README.md 1e-4, # When to stop iterations # Max number of fitting iterations # Minibatch size # Gradient clip "defaultCNN", 1e-4, # Lambda, parameter for this family of method

## Environments

To add a new environment, implement an OpenAI gym-like environment and place the environment in [the envs directory](https://github.com/clvoloshin/COBS/tree/master/ope/envs). The environment should implement the reset, step, and (optionally) render functions. Each environment must also contain two variables
```
self.n_dim # The number of states (if discrete), otherwise set this to 1.
self.n_actions # The number of possible actions
```

See [Tutorial.ipynb](https://github.com/clvoloshin/COBS/blob/master/Tutorial.ipynb) for how to instantiate the environment during an experiment.

## Baselines

Current Direct Method Baselines: FQE, Retrace, Tree-Backup, Q^pi(lambda), Q-Reg, MRDR, IH, MBased.

### Direct Method

To add a new Direct Method, implement one of the [Direct Method classes](https://github.com/clvoloshin/COBS/blob/master/ope/algos/direct_method.py) and put the new method in the [algos directory](https://github.com/clvoloshin/COBS/blob/master/ope/algos).

#### Q Function Based
Suppose your new method is called NewMethod and it works by fitting a Q function.

Modify line 149 of [experiment.py](https://github.com/clvoloshin/COBS/blob/master/ope/experiment_tools/experiment.py) by adding:
```
...
elif 'NewMethod' == model:
new_method = NewMethod() ## Instatiates the method
new_method.fit(behavior_data, pi_e, cfg, cfg.models[model]['model']) ## Fits the method
new_method_Qs = new_method.get_Qs_for_data(behavior_data, cfg) ## Gets Q(s, a) for each s in the data and a in the action space.
out = self.estimate(new_method_Qs, behavior_data, gamma, model, true) ## Get Direct and Hybrid estimates and error.
dic.update(out)
...
```

#### Weight Function Based
Suppose your new method is called NewMethod and it works by fitting a weight function.

Modify line 149 of [experiment.py](https://github.com/clvoloshin/COBS/blob/master/ope/experiment_tools/experiment.py) by adding:
```
...
elif 'NewMethod' == model:
new_method = NewMethod() ## Instatiates the method
new_method.fit(behavior_data, pi_e, cfg, cfg.models[model]['model']) ## Fits the method
new_method_output = new_method.evaluate(behavior_data, cfg) ## Evaluate the method
dic.update({'NewMethod': [new_method_output, (new_method_output - true )**2]}) ## Update results
...
```

### Hybrid Method

Current Hybrid Method Baselines: DR, WDR, MAGIC

TODO: How to add your own.

### IPS Method

Current IPS Method Baselines: Naive, IS, Per-Decision IS, WIS, Per-Decision WIS

TODO: How to add your own.

## Adding a new model to the configuration

Add your own NN architechture to a new file in the [models directory](https://github.com/clvoloshin/COBS/tree/master/ope/models). Then modify the get_model_from_name function in [factory.py](https://github.com/clvoloshin/COBS/blob/master/ope/experiment_tools/factory.py)
```
from ope.models.YourNN import YourNN

def get_model_from_name(name):
...
elif name == 'YourNN'
return YourNN
...
```
You can now add your own NN as a method's model in the configuration:
```
"SomeMethod": {
"model": "YourNN", # YourNN model
...other params....
},
```

## Policies

There are currently two available policy types.
1. Basic Policy: pi(.|s) = [prob(a=0), prob(a=1),..., prob(a=n)]
2. Epsilon-Greedy: pi(.|s) = Greedy(s) with prob 1-e and random(1,..,n) otherwise

## Citing COBS

If you use COBS, please use the following BibTeX entry.

```
@inproceedings{
voloshin2021empirical,
title={Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning},
author={Cameron Voloshin and Hoang Minh Le and Nan Jiang and Yisong Yue},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
year={2021},
url={https://openreview.net/forum?id=IsK8iKbL-I}
}
```