https://github.com/phamduyaaaa/play-all-toytext-with-q-learning
Q-Learning applied to Gymnasium's Toy Text environments: FrozenLake, CliffWalking, BlackJack, and Taxi.
https://github.com/phamduyaaaa/play-all-toytext-with-q-learning
gymnasium q-learning-algorithm
Last synced: 6 months ago
JSON representation
Q-Learning applied to Gymnasium's Toy Text environments: FrozenLake, CliffWalking, BlackJack, and Taxi.
- Host: GitHub
- URL: https://github.com/phamduyaaaa/play-all-toytext-with-q-learning
- Owner: phamduyaaaa
- Created: 2024-12-05T11:20:44.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-03-04T03:19:29.000Z (7 months ago)
- Last Synced: 2025-03-24T03:02:28.770Z (7 months ago)
- Topics: gymnasium, q-learning-algorithm
- Language: Python
- Homepage:
- Size: 14.8 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🎮 Play-All-ToyText with Q-Learning
Welcome to the **Play-All-ToyText with Q-Learning** project! 🚀
In this project, I've applied the Q-Learning algorithm to solve problems in popular ToyText environments like FrozenLake, CliffWalking, Blackjack, and Taxi.The goal is to train agents using Q-Learning to optimize policies and maximize rewards in these environments.
---
## 🔍 IntroductionIn ToyText environments, agents learn to take actions by maximizing rewards in games such as:
- **FrozenLake** 🌊
- **CliffWalking** 🧗♂️
- **BlackJack** 🃏
- **Taxi** 🚕The objective of this project is to apply the **Q-Learning** algorithm to optimize agents' policies and achieve the highest possible rewards.
---
### Q-learning Update Rule
The Q-learning update for the Q-table is expressed as:$$
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
$$---
### Explanation of Terms
- **$Q(s, a)$:** The current Q-value for performing action $a$ in state $s$.
- **$alpha$ $(Learning Rate)$:** A scalar that controls how much the new information influences the update. Values range from $0$ to $1$.
- **$r$ $(Reward)$:** The immediate reward received after performing action $a$ in state $s$.
- **$gamma$ $(Discount Factor)$:** A scalar between $0$ and $1$ that determines the importance of future rewards. A higher $gamma$ emphasizes long-term rewards.
- **$max_{a'}$ $Q(s', a')$:** The maximum Q-value for the next state $s'$ across all possible actions $a'$.
- **$(s, a)$:** The current state and action pair.
- **$(s', a')$:** The next state and the set of possible actions.---
### Key Concepts
1. **Temporal Difference (TD) Error:**
The difference between the expected Q-value and the current Q-value:
$TD\$ $Error$ $=$ $r$ + $\gamma \max_{a'}$ $Q(s', a')$ - $Q(s, a)$3. **Q-value Update:**
The Q-value for the current state-action pair $(s, a)$ is updated using the TD error, scaled by the learning rate $(alpha)$. This balances learning from new experiences versus relying on existing knowledge.4. **Learning Dynamics:**
- The update incorporates both the immediate reward $r$ and the discounted future rewards $gamma$ $max_{a'}$ $Q(s', a')$.
- Over time, the Q-table converges to optimal values, assuming sufficient exploration and a properly tuned learning rate.
---## 🌐 Environments
| Environment | Demo | Plot (Results) |
|-----------------|------|---------------|
| [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) ||
|
| [CliffWalking-v0](https://www.gymlibrary.dev/environments/toy_text/cliff_walking/) ||
|
| [BlackJack-v1](https://www.gymlibrary.dev/environments/toy_text/blackjack/) ||
|
| [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/) ||
|
---