https://github.com/banditml/offline-policy-evaluation
Implementations and examples of common offline policy evaluation methods in Python.
https://github.com/banditml/offline-policy-evaluation
counterfactual-learning counterfactual-policy-evaluation doubly-robust importance-sampling off-policy-evaluation offline-policy-evaluation
Last synced: about 1 year ago
JSON representation
Implementations and examples of common offline policy evaluation methods in Python.
- Host: GitHub
- URL: https://github.com/banditml/offline-policy-evaluation
- Owner: banditml
- License: other
- Created: 2020-03-10T03:09:14.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2023-02-11T00:33:53.000Z (over 3 years ago)
- Last Synced: 2025-03-30T01:07:43.663Z (over 1 year ago)
- Topics: counterfactual-learning, counterfactual-policy-evaluation, doubly-robust, importance-sampling, off-policy-evaluation, offline-policy-evaluation
- Language: Python
- Homepage:
- Size: 1.17 MB
- Stars: 222
- Watchers: 6
- Forks: 25
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Offline policy evaluation
[](https://badge.fury.io/py/offline-evaluation) [](https://github.com/ambv/black) [](https://pepy.tech/project/offline-evaluation)
Implementations and examples of common offline policy evaluation methods in Python. For more information on offline policy evaluation see this [tutorial](https://edoconti.medium.com/offline-policy-evaluation-run-fewer-better-a-b-tests-60ce8f93fa15).
## Installation
`pip install offline-evaluation`
## Usage
```
from ope.methods import doubly_robust
```
Get some historical logs generated by a previous policy:
```
df = pd.DataFrame([
{"context": {"p_fraud": 0.08}, "action": "blocked", "action_prob": 0.90, "reward": 0},
{"context": {"p_fraud": 0.03}, "action": "allowed", "action_prob": 0.90, "reward": 20},
{"context": {"p_fraud": 0.02}, "action": "allowed", "action_prob": 0.90, "reward": 10},
{"context": {"p_fraud": 0.01}, "action": "allowed", "action_prob": 0.90, "reward": 20},
{"context": {"p_fraud": 0.09}, "action": "allowed", "action_prob": 0.10, "reward": -20},
{"context": {"p_fraud": 0.40}, "action": "allowed", "action_prob": 0.10, "reward": -10},
])
```
Define a function that computes `P(action | context)` under the new policy:
```
def action_probabilities(context):
epsilon = 0.10
if context["p_fraud"] > 0.10:
return {"allowed": epsilon, "blocked": 1 - epsilon}
return {"allowed": 1 - epsilon, "blocked": epsilon}
```
Conduct the evaluation:
```
doubly_robust.evaluate(df, action_probabilities)
> {'expected_reward_logging_policy': 3.33, 'expected_reward_new_policy': -28.47}
```
This means the new policy is significantly worse than the logging policy. Instead of A/B testing this new policy online, it would be better to test some other policies offline first.
See examples for more detailed tutorials.
## Supported methods
- [x] Inverse propensity scoring
- [x] Direct method
- [x] Doubly robust ([paper](https://arxiv.org/abs/1503.02834))