https://github.com/stonet2000/tictactoe-temporal-difference-learning
im learning rl
https://github.com/stonet2000/tictactoe-temporal-difference-learning
Last synced: 10 months ago
JSON representation
im learning rl
- Host: GitHub
- URL: https://github.com/stonet2000/tictactoe-temporal-difference-learning
- Owner: StoneT2000
- Created: 2021-01-21T09:23:10.000Z (almost 5 years ago)
- Default Branch: main
- Last Pushed: 2021-01-21T19:45:26.000Z (almost 5 years ago)
- Last Synced: 2025-02-05T13:52:40.458Z (11 months ago)
- Language: Python
- Size: 6.84 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# TicTacToe Temporal Difference Learning
Train an agent to play Tic Tac Toe using Temporal Difference Learning
The agent has a few parameters:
```
TDAgent(**game_settings,
include_end_game_bias=True,
step_size_param=0.7,
explore_ratio=0.5)
```
`step_size_param` is the step size parameter for updating the value function table used to make decisions
`explore_ratio` is the probability the agent will instead of choosing the action that maximizes the value function, chooses the action that leads to a state that has been explored the least
`include_end_game_bias` is set to True when we hardcode the true values of end game states (e.g. 3 X's or O's in a row)
Run `src/run.py` to train and test an agent against a random agent
## Interesting Behaviors
After enough episodes of training, the TD agent learns that in the first move, placing an "O" in the corners are of higher value than other positions.
Interestingly, since the agent the TD agent trains against is random, the TD agent doesn't realize that placing "O" in anywhere but the corner is generally a losing position. Goes to show that a bad teacher (the random agent) means the student (the TD agent) cannot learn the best strategies