https://github.com/sukiboo/rl_personalization_challenge
Challenge of solving a personalization task with RL methods.
https://github.com/sukiboo/rl_personalization_challenge
contextual-bandits openai-gym personalization reinforcement-learning simulation
Last synced: 7 months ago
JSON representation
Challenge of solving a personalization task with RL methods.
- Host: GitHub
- URL: https://github.com/sukiboo/rl_personalization_challenge
- Owner: sukiboo
- License: mit
- Created: 2022-02-14T23:29:01.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2022-03-26T23:51:10.000Z (over 3 years ago)
- Last Synced: 2025-01-21T12:35:53.647Z (9 months ago)
- Topics: contextual-bandits, openai-gym, personalization, reinforcement-learning, simulation
- Language: Python
- Homepage:
- Size: 16.6 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Reinforcement Learning Personalization Challenge
In this challenge your goal is to train an RL agent to solve a personalization task that is simulated as a synthetic contextual bandit.In the given environment the state space is continuous and is reperesented by a `100`-dimensional hypercube, and the action space is discrete and consists of `100` fixed `100`-dimensional vectors. The reward signal is a bit convoluted in its nature but it's intentionally made this way to mimic the human behavioral preferencing (*and we can discuss what it actually means or how it's designed in more detail*).
The rationale behind such an environment is the following: the set of available actions represents the *possible recommendations* and the observed states are the parameterized representations of the *persons* to whom the recommendations are provided; for each state-action pair the received reward value represents the *suitability* of the provided recommendation for the given person with `1` being the best recomendation and `-1` being the worst.
## Setup
Install the dependancies with `pip install -r requirements.txt`, then run with `python main.py`.## Files
Essentially this simple repository consists of the following files:
* `environment.py` --- contains the class `SampleEnv` that creates an OpenAI Gym contextual bandit environment
* `main.py` --- trains a policy gradient agent, serving as a basic baseline --- **modify this file to implement and train your agent**## Environment
The generated `SampleEnv` environment inherits from `gym.Env` and, as such, has the following methods:
* `reset()` --- observe a new state
* `step(action)` --- take an action and return the resultThe above methods are technically sufficient to solve the environment.
Other useful methods include
* `evaluate_agent(agent)` --- compute the *deterministic* performance of the agent's policy on the environment
* `restart()` --- fully recreate the environment; should be called between the training of different agents for reproducibility
* `observe(num=1)` --- observe new states; identical to `reset` but can sample multiple states (`num`) simultaneously
* `compute_reward(s,a_ind)` --- compute the *normalized* reward for a state `s` and an action index `a_ind`
* `compute_reward_raw(s,a)` --- compute the *un-normalized* reward value of a state-action pair `(s,a)`
* `print_action_histogram()` --- print the histogram of the optimal actions; ideally an agent should provide a similar histogram## Results
By default the reward values returned by the environment are *normalized*, i.e. the optimal reward for any state `s` is `1` and the average reward is `0`.
Hence any sensible agent should achieve a positive return and the optimal agent has the return of `1`.
For example, the current baseline agent achieves a performance score of `0.2318`.The intended outcome is to train an agent that demonstrates a *good* performance, e.g. `> 0.8` or so.
If you manage to obtain such an agent, please let me know!