https://github.com/neymarl/pacman-rl
Implement some reinforcement learning algorithms, test and visualize on Pacman.
https://github.com/neymarl/pacman-rl
actor-critic pacman policy policy-gradient q-learning reinforcement-learning sarsa-lambda
Last synced: 3 months ago
JSON representation
Implement some reinforcement learning algorithms, test and visualize on Pacman.
- Host: GitHub
- URL: https://github.com/neymarl/pacman-rl
- Owner: NeymarL
- Created: 2018-09-26T03:12:34.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2018-12-03T07:40:15.000Z (almost 7 years ago)
- Last Synced: 2025-04-23T09:40:58.976Z (6 months ago)
- Topics: actor-critic, pacman, policy, policy-gradient, q-learning, reinforcement-learning, sarsa-lambda
- Language: Python
- Homepage: https://www.52coding.com.cn/
- Size: 7.26 MB
- Stars: 27
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Pacman-RL
Implement some reinforcement learning algorithms, test and visualize on Pacman under [OpenAI's Gym](https://gym.openai.com/) environment.
## Requirements
* Python 3.6+
* gym
* matplotlib
* tensorflow
* keras
* mujoco_py (if you want to save replay)
* torch
* torchvision## Run
* Run `python run.py --controller MC train` for training using Monte-Carlo control. The weight file will be saved as `weights/mc.h5`.
* Run `python run.py --controller MC --render --show_plot --evaluate_episodes 10 evaluate` for evaluation using Monte-Carlo control. It will render the Pacman environment and show the dynamic Q-value and reward plot at the same time.```
Full usage: run.py [-h]
[--controller {MC,Sarsa,Sarsa_lambda,Q_learning,REINFORCE,ActorCritic,A3C,PPO}]
[--render] [--save_replay] [--save_plot] [--show_plot]
[--num_episodes NUM_EPISODES] [--batch_size BATCH_SIZE]
[--eva_interval EVA_INTERVAL]
[--evaluate_episodes EVALUATE_EPISODES] [--lr LR]
[--epsilon EPSILON] [--gamma GAMMA] [--lam LAM] [--forward]
[--max_workers MAX_WORKERS] [--t_max T_MAX]
{train,evaluate}positional arguments:
{train,evaluate} what to dooptional arguments:
-h, --help show this help message and exit
--controller {MC,Sarsa,Sarsa_lambda,Q_learning,REINFORCE,ActorCritic,A3C,PPO}
choose an algorithm (controller)
--render set to render the env when evaluate
--save_replay set to save replay
--save_plot set to save Q-value plot when evaluate
--show_plot set to show Q-value plot when evaluate
--num_episodes NUM_EPISODES
set to run how many episodes
--batch_size BATCH_SIZE
set the batch size
--eva_interval EVA_INTERVAL
set how many episodes evaluate once
--evaluate_episodes EVALUATE_EPISODES
set evaluate how many episodes
--lr LR set learning rate
--epsilon EPSILON set epsilon when use epsilon-greedy
--gamma GAMMA set reward decay rate
--lam LAM set lambda if use sarsa(lambda) algorithm
--forward set to use forward-view sarsa(lambda)
--rawpixels set to use raw pixels as input (only valid to PPO)
--max_workers MAX_WORKERS
set max workers to train
--t_max T_MAX set simulate how many timesteps until update param
```

## Reinforcement Learning Algorithms
### Monte-Carlo Control
* Policy evaluation
* 
* * Policy improvement: 𝜀-greedy with 𝜀 decay
* Q-value function approximation: A fully connected layer (input layer and output layer with no hidden layer)
### Sarsa(0)
* Policy evaluation
* 
* Policy improvement: 𝜀-greedy with 𝜀 decay
* Q-value function approximation: A fully connected layer (input layer and output layer with no hidden layer)
### Sarsa(𝝀)
**Forward-view**
* Policy evaluation
* 
* 
* 
* Policy improvement: 𝜀-greedy with 𝜀 decay
* Q-value function approximation: A fully connected layer (input layer and output layer with no hidden layer)**Backward-view**
* Policy evaluation
* 
* Accumulating eligibility trace: 
* Policy improvement: 𝜀-greedy with 𝜀 decay
* Q-value function approximation: A fully connected layer (input layer and output layer with no hidden layer)
### Q-learning
* Policy evaluation
* 
* Policy improvement: 𝜀-greedy with 𝜀 decay
* Q-value function approximation: A fully connected layer (input layer and output layer with no hidden layer)
### REINFORCE
**Monte-Carlo policy gradient**
* Use return Gt to estimate : 
* Policy function approximation: Softmax policy with a fc layer**Note**: You shold pick a very small `lr` to train a decent model, e.g. `lr = 0.00001`
### Advantage Actor-Critic
* Actor
* Softmax policy with a fc layer
* Use advantage function to estimate : , where * Critic
* TD policy evaluation 
* 
* Value function approximation: a fully connected layer (input layer and output layer with no hidden layer)
### Asynchronous Advantage Actor-Critic (A3C)


### Trust Region Policy Optimization (TRPO)

**Note**: Running with OpenAI [Spinning Up](https://github.com/openai/spinningup), TRPO is not implemented in this repo.### Proximal Policy Optimization (PPO)

Run with:
```bash
python run.py --controller PPO --max_worker 6 --gamma 0.99 --evaluate_episodes 50 --batch_size 20 --epsilon 0.2 --lam 0.97 --eva_interval 100 train
```