https://github.com/swamikannan/cliffwalk
Cliffwalk to compare SARSA and Q-Learning
https://github.com/swamikannan/cliffwalk
cliffwalking python3 q-learning q-learning-algorithm q-learning-vs-sarsa sarsa-learning
Last synced: about 1 month ago
JSON representation
Cliffwalk to compare SARSA and Q-Learning
- Host: GitHub
- URL: https://github.com/swamikannan/cliffwalk
- Owner: SwamiKannan
- License: mit
- Created: 2022-09-05T05:13:12.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2022-10-25T14:40:40.000Z (over 3 years ago)
- Last Synced: 2025-06-06T11:04:31.893Z (about 1 year ago)
- Topics: cliffwalking, python3, q-learning, q-learning-algorithm, q-learning-vs-sarsa, sarsa-learning
- Language: Jupyter Notebook
- Homepage:
- Size: 1.7 MB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
- License: LICENSE
Awesome Lists containing this project
README
This is a series of projects where I solve RL environments by building RL algorithms from scratch using Python, Pytorch and Tensorflow
# Exercise
Compare the SARSA and the Q-learning algorithms using the GridWorld Cliff walking environment
# CliffWalk

## Environment:
This is a simple implementation of the Gridworld Cliff reinforcement learning task.
Adapted from Example 6.6 (page 106) from Reinforcement Learning: An Introduction by Sutton and Barto: http://incompleteideas.net/book/bookdraft2018jan1.pdf
With inspiration from: https://github.com/dennybritz/reinforcement-learning/blob/master/lib/envs/cliff_walking.py
The board is a 4x12 matrix, with (using NumPy matrix indexing):
o [3, 0] as the start at bottom-left
o [3, 11] as the goal at bottom-right
o [3, 1..10] as the cliff at bottom-center
Each time step incurs -1 reward, and stepping into the cliff incurs -100 reward and a reset to the start. An episode terminates when the agent reaches the goal.
From Sutton and Barto's Reinforcement Learning: An Introduction textbook
Example 6.6: Cliff Walking This gridworld example compares Sarsa and Q-learning, highlighting the difference between on-policy (Sarsa) and off-policy (Q-learning) methods. Consider the gridworld shown below. This is a standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down,right, and left. Reward is -1 on all transitions except those into the region marked “The Cliff”. Stepping into this region incurs a reward of -100 and sends the agent instantly back to the start.