https://github.com/hartikainen/easy21
Reinforcement learning agents and environment for Easy21, a modified version of Blackjack
https://github.com/hartikainen/easy21
easy21 monte-carlo policy-gradient reinforcement-learning sarsa
Last synced: about 2 months ago
JSON representation
Reinforcement learning agents and environment for Easy21, a modified version of Blackjack
- Host: GitHub
- URL: https://github.com/hartikainen/easy21
- Owner: hartikainen
- Archived: true
- Created: 2017-04-11T02:21:34.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2017-05-07T05:04:05.000Z (about 8 years ago)
- Last Synced: 2025-01-31T13:45:20.109Z (4 months ago)
- Topics: easy21, monte-carlo, policy-gradient, reinforcement-learning, sarsa
- Language: Python
- Homepage:
- Size: 2.77 MB
- Stars: 14
- Watchers: 2
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# easy21
This repository implements the assignment requirements for the reinforcement learning course given by David Silver [1]. It implements a reinforcement learning environment and four different agents, namely monte-carlo, sarsa lambda, linear value function approximation, and neural network policy gradient, for simple card game called Easy21, presented in [2].## Setting up the environment
To setup the python environment you need Python v3.x, pip, and virtualenv:
```
git clone https://github.com/hartikainen/easy21.git
cd easy21pip install virtualenv
virtualenv .venv
source .venv/bin/activatepip install -r requirements.txt
```## Running the game
To run the game and test the agents, run the `easy21.py` file as follows:
```
python easy21.py [-h] [-v VERBOSE] [-a {mc,sarsa,lfa}]
[--num-episodes NUM_EPISODES] [--lmbd LMBD] [--gamma GAMMA]
[--plot-v PLOT_V] [--dump-q DUMP_Q]
[--plot-lambda-mse PLOT_LAMBDA_MSE]
[--plot-learning-curve PLOT_LEARNING_CURVE]
```
See `easy21.py` for more information about the running the game and testing the agents. All the agents are found in the `/agents` folder.## Easy21 Environment
The Easy21 environment is implemented `Easy21Env` class found in `environment.py`. The environment keeps track of the game state (dealer card and sum of player cards), and exposes a `step` method, which, given an action (hit or stick), updates its state, and returns the observed state (in our case observation is equivalent to the game state) and reward.## Monte-Carlo Control in Easy21
Monte-Carlo control for Easy21 is implemented in file `agents/monte_carlo.py`. The default implementation uses a time-varying scalar step-size αt = 1/N(st, at) and ε-greedy exploration strategy with εt = N0 / (N0 + N(St)), where N0 = 100, N(s) is the number of times that state s has been visited, N(s,a) is the number of times that action a has been selected from state s, and t is the time-step.
The figure below presents the optimal value function V\* against the game state (player sum and dealer hand).
## TD Learning in Easy21
File `agents/sarsa.py` implements a Sarsa(λ) control for Easy21. It uses the same step-size and exploration schedules as the Monte-Carlo agent described in the previous section. The agent is tested with parameter values λ ∈ {0, 0.1, 0.2, ..., 1}, each ran for 20000 episodes. The first figure below present the learning curve, i.e. the mean-squared error vs. 'true' Q values against episode number, for each lambda. The next two figures plot the function V\* (same as in Monte-Carlo section) for λ=0.0 and λ=1.0.


### Bootstrapping in Easy21
As with any other situation, also in Easy21, bootstrapping reduces the variance of the learned policy, with the cost of increased bias. The Sarsa(λ) agent learns reasonable policy faster (i.e. in smaller number of episodes) than Monte-Carlo agent. I a game as simple as Easy21, however, it is feasible to run enough episodes for the Monte-Carlo agent to converge to the optimal unbiased policy.
The episodes in Easy21 on average last longer than in traditional Blackjack game because of the subtractive effect of red cards. Because of this, boostrapping is likely to be more useful in Easy21 than it would be in traditional Blackjack game.
## Linear Function Approximation in Easy21
File `agents/function_approximation.py` implements a value function approximator with coarse coding for Easy21, using binary feature vector φ(state,action) with 36 (3\*6\*2) features. Each binary feature takes value 1 if (state, action) lies in the cuboid of state-space corresponding to that feature, and the action corresponding to that feature, and 0 otherwise. The cuboids are defined in the variable `CUBOID_INTERVALS` in `agents/function_approximation.py`.Similarly to the Sarsa(λ) in the previous section, we run tests with 20000 episodes for parameter values λ ∈ {0, 0.1, 0.2, ..., 1}, with constant step-size α=0.01 and exploration value ε=0.05. The figures below plot the learning curve for each lambda, and the function V\* for λ=0.0 and λ=1.0.



### Notes for function approximation
Using approximation for the state and action space reduces the time and space complexity of the algorithm\*\*, due to the reduced number of variables, corresponding to the states and actions, needed to learn by the agent. However, this comes with the cost of reduced accuracy of the learned state-value function Q (and thus value function V and policy π). It seems like the overlapping regions in the cuboid intervals result in more extreme values in some states. This happens because each state in the expanded Q function approximation can be affected by multiple states and actions through the weights of function approximation.### The effect of constant step-size α
One thing to notice is the effect of constant step-size used for the linear function approximation. Because the step-size is kept constant in the learning, some regions of the value function receive much less training than other parts. This results in incorrect value function in the extreme regions where the number of visits is small. This effect is tested with cuboid intervals corresponding to "identity" features, i.e. features that should result in exactly the same Q as learning with sarsa lambda. Even after 50k runs, the agent seems to have spots in the state space (low dealer and low player; low dealer and high player), where it doesn't match the sarsa results. This effect is presented in the figure below (for λ=0).
When using dynamic step-size α, as in the Sarsa(λ) section above, we can see that this effect disappears, and the function approximation results in expected approximation, as shown in the figure below (again, for λ=0).

\*\* the actual running time is actually worse than with Sarsa(λ) because my function approximation implementation does not fully utilize numpy vectorization
## Policy Gradient in Easy21
File `agents/policy_gradient.py` implements a simple two layer feed forward neural network and a policy gradient agent that utilizes the network. The network used by the agent uses 20 hidden neurons with ReLU non-linearities, and learning rate α=5*10-4. The weights of the network are initialized from gaussian distribution with mean μ=0 and standard deviation σ=10-4, and biases are initially set to 0. The figure below represents the running average (with 1000 episode window) of the reward received by the policy gradient agent over total of 100,000 episodes.
[1]http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
[2]http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/Easy21-Johannes.pdf