An open API service indexing awesome lists of open source software.

https://github.com/banditml/offline-policy-evaluation

Implementations and examples of common offline policy evaluation methods in Python.
https://github.com/banditml/offline-policy-evaluation

counterfactual-learning counterfactual-policy-evaluation doubly-robust importance-sampling off-policy-evaluation offline-policy-evaluation

Last synced: about 1 year ago
JSON representation

Implementations and examples of common offline policy evaluation methods in Python.

Awesome Lists containing this project

README

          

# Offline policy evaluation
[![PyPI version](https://badge.fury.io/py/offline-evaluation.svg)](https://badge.fury.io/py/offline-evaluation) [![](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black) [![Downloads](https://static.pepy.tech/personalized-badge/offline-evaluation?period=total&units=international_system&left_color=black&right_color=brightgreen&left_text=Downloads)](https://pepy.tech/project/offline-evaluation)

Implementations and examples of common offline policy evaluation methods in Python. For more information on offline policy evaluation see this [tutorial](https://edoconti.medium.com/offline-policy-evaluation-run-fewer-better-a-b-tests-60ce8f93fa15).

## Installation
`pip install offline-evaluation`

## Usage
```
from ope.methods import doubly_robust
```

Get some historical logs generated by a previous policy:
```
df = pd.DataFrame([
{"context": {"p_fraud": 0.08}, "action": "blocked", "action_prob": 0.90, "reward": 0},
{"context": {"p_fraud": 0.03}, "action": "allowed", "action_prob": 0.90, "reward": 20},
{"context": {"p_fraud": 0.02}, "action": "allowed", "action_prob": 0.90, "reward": 10},
{"context": {"p_fraud": 0.01}, "action": "allowed", "action_prob": 0.90, "reward": 20},
{"context": {"p_fraud": 0.09}, "action": "allowed", "action_prob": 0.10, "reward": -20},
{"context": {"p_fraud": 0.40}, "action": "allowed", "action_prob": 0.10, "reward": -10},
])
```
Define a function that computes `P(action | context)` under the new policy:
```
def action_probabilities(context):
epsilon = 0.10
if context["p_fraud"] > 0.10:
return {"allowed": epsilon, "blocked": 1 - epsilon}
return {"allowed": 1 - epsilon, "blocked": epsilon}
```
Conduct the evaluation:
```
doubly_robust.evaluate(df, action_probabilities)
> {'expected_reward_logging_policy': 3.33, 'expected_reward_new_policy': -28.47}
```
This means the new policy is significantly worse than the logging policy. Instead of A/B testing this new policy online, it would be better to test some other policies offline first.

See examples for more detailed tutorials.

## Supported methods

- [x] Inverse propensity scoring
- [x] Direct method
- [x] Doubly robust ([paper](https://arxiv.org/abs/1503.02834))