https://github.com/alihassanml/reniforcement-learning-maze
https://github.com/alihassanml/reniforcement-learning-maze
Last synced: 8 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/alihassanml/reniforcement-learning-maze
- Owner: alihassanml
- License: mit
- Created: 2025-02-19T16:28:54.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-02-19T16:37:26.000Z (8 months ago)
- Last Synced: 2025-02-19T17:35:22.042Z (8 months ago)
- Language: Jupyter Notebook
- Size: 23.6 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
### **Markov Decision Processes (MDPs) in Reinforcement Learning**
Markov Decision Processes (MDPs) provide the mathematical framework for decision-making in environments with uncertainty. They are widely used in **reinforcement learning (RL)** to model sequential decision-making problems.
---
## **1. Components of MDPs**
An MDP is defined by a **5-tuple** \( (S, A, P, R, \gamma) \):- **\( S \) (State Space):** The set of all possible states the agent can be in.
- **\( A \) (Action Space):** The set of all possible actions the agent can take.
- **\( P(s' | s, a) \) (Transition Probability):** The probability of moving to state \( s' \) given that the agent was in state \( s \) and took action \( a \).
- **\( R(s, a) \) (Reward Function):** The reward received after taking action \( a \) in state \( s \).
- **\( \gamma \) (Discount Factor):** A factor \( (0 \leq \gamma \leq 1) \) that determines the importance of future rewards.---
## **2. The Markov Property**
The Markov property states that the future state \( s' \) depends only on the current state \( s \) and action \( a \), not on past states or actions:\[
P(s' | s, a, s_{t-1}, a_{t-1}, ...) = P(s' | s, a)
\]This makes MDPs **memoryless**, simplifying decision-making.
---
## **3. Policy and Value Functions**
- **Policy \( \pi(a | s) \):** Defines the agent's strategy by mapping states to actions.
- **State-Value Function \( V^\pi(s) \):** Expected return from state \( s \) following policy \( \pi \):\[
V^\pi(s) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \mid s_0 = s, \pi \right]
\]- **Action-Value Function \( Q^\pi(s, a) \):** Expected return for taking action \( a \) in state \( s \) and following policy \( \pi \):
\[
Q^\pi(s, a) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \mid s_0 = s, a_0 = a, \pi \right]
\]---
## **4. Bellman Equations**
MDPs satisfy the **Bellman equation**, which expresses the recursive relationship of value functions:### **For State-Value Function:**
\[
V^\pi(s) = \sum_{a} \pi(a | s) \sum_{s'} P(s' | s, a) \left[ R(s, a) + \gamma V^\pi(s') \right]
\]### **For Action-Value Function:**
\[
Q^\pi(s, a) = \sum_{s'} P(s' | s, a) \left[ R(s, a) + \gamma \sum_{a'} \pi(a' | s') Q^\pi(s', a') \right]
\]---
## **5. Solving MDPs**
To find the optimal policy \( \pi^* \), we maximize the expected reward:\[
\pi^*(s) = \arg\max_{a} Q^*(s, a)
\]### **Solution Methods:**
1. **Dynamic Programming (DP)**
- Policy Iteration
- Value Iteration
2. **Monte Carlo Methods** (for model-free learning)
3. **Temporal Difference (TD) Learning**
- SARSA (On-policy)
- Q-learning (Off-policy)---
## **6. MDPs in Reinforcement Learning**
- **Model-Based RL:** Uses an explicit MDP model (transition probabilities).
- **Model-Free RL:** Learns from experience (e.g., Q-learning, Deep Q Networks).
- **Deep RL:** Uses neural networks to approximate value functions in large MDPs.---
## **Conclusion**
MDPs form the foundation of **reinforcement learning** by providing a structured way to model decision-making. Understanding MDPs is crucial for designing RL algorithms that learn optimal strategies in uncertain environments.