https://github.com/alihassanml/reniforcement-learning-maze

Last synced: 8 months ago
JSON representation

Host: GitHub
URL: https://github.com/alihassanml/reniforcement-learning-maze
Owner: alihassanml
License: mit
Created: 2025-02-19T16:28:54.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-02-19T16:37:26.000Z (8 months ago)
Last Synced: 2025-02-19T17:35:22.042Z (8 months ago)
Language: Jupyter Notebook
Size: 23.6 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          ### **Markov Decision Processes (MDPs) in Reinforcement Learning**  

Markov Decision Processes (MDPs) provide the mathematical framework for decision-making in environments with uncertainty. They are widely used in **reinforcement learning (RL)** to model sequential decision-making problems.  

---

## **1. Components of MDPs**  

An MDP is defined by a **5-tuple** \( (S, A, P, R, \gamma) \):

- **\( S \) (State Space):** The set of all possible states the agent can be in.  

- **\( A \) (Action Space):** The set of all possible actions the agent can take.  

- **\( P(s' | s, a) \) (Transition Probability):** The probability of moving to state \( s' \) given that the agent was in state \( s \) and took action \( a \).  

- **\( R(s, a) \) (Reward Function):** The reward received after taking action \( a \) in state \( s \).  

- **\( \gamma \) (Discount Factor):** A factor \( (0 \leq \gamma \leq 1) \) that determines the importance of future rewards.

---

## **2. The Markov Property**  

The Markov property states that the future state \( s' \) depends only on the current state \( s \) and action \( a \), not on past states or actions:  

\[

P(s' | s, a, s_{t-1}, a_{t-1}, ...) = P(s' | s, a)

\]

This makes MDPs **memoryless**, simplifying decision-making.

---

## **3. Policy and Value Functions**  

- **Policy \( \pi(a | s) \):** Defines the agent's strategy by mapping states to actions.  

- **State-Value Function \( V^\pi(s) \):** Expected return from state \( s \) following policy \( \pi \):  

  \[

  V^\pi(s) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \mid s_0 = s, \pi \right]

  \]

- **Action-Value Function \( Q^\pi(s, a) \):** Expected return for taking action \( a \) in state \( s \) and following policy \( \pi \):  

  \[

  Q^\pi(s, a) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \mid s_0 = s, a_0 = a, \pi \right]

  \]

---

## **4. Bellman Equations**  

MDPs satisfy the **Bellman equation**, which expresses the recursive relationship of value functions:  

### **For State-Value Function:**

\[

V^\pi(s) = \sum_{a} \pi(a | s) \sum_{s'} P(s' | s, a) \left[ R(s, a) + \gamma V^\pi(s') \right]

\]

### **For Action-Value Function:**

\[

Q^\pi(s, a) = \sum_{s'} P(s' | s, a) \left[ R(s, a) + \gamma \sum_{a'} \pi(a' | s') Q^\pi(s', a') \right]

\]

---

## **5. Solving MDPs**  

To find the optimal policy \( \pi^* \), we maximize the expected reward:

\[

\pi^*(s) = \arg\max_{a} Q^*(s, a)

\]

### **Solution Methods:**

1. **Dynamic Programming (DP)**

   - Policy Iteration

   - Value Iteration  

2. **Monte Carlo Methods** (for model-free learning)  

3. **Temporal Difference (TD) Learning**  

   - SARSA (On-policy)

   - Q-learning (Off-policy)  

---

## **6. MDPs in Reinforcement Learning**  

- **Model-Based RL:** Uses an explicit MDP model (transition probabilities).  

- **Model-Free RL:** Learns from experience (e.g., Q-learning, Deep Q Networks).  

- **Deep RL:** Uses neural networks to approximate value functions in large MDPs.

---

## **Conclusion**  

MDPs form the foundation of **reinforcement learning** by providing a structured way to model decision-making. Understanding MDPs is crucial for designing RL algorithms that learn optimal strategies in uncertain environments.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alihassanml/reniforcement-learning-maze

Awesome Lists containing this project

README