https://github.com/anoncheg1/rl-learning

Numpy only implementations of RL algorithms
https://github.com/anoncheg1/rl-learning
reinforcement-learning-algorithms
Last synced: 3 months ago
JSON representation
Numpy only implementations of RL algorithms
Host: GitHub
URL: https://github.com/anoncheg1/rl-learning
Owner: Anoncheg1
Created: 2024-11-04T23:51:39.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-11-05T00:23:18.000Z (8 months ago)
Last Synced: 2025-02-01T21:35:11.685Z (5 months ago)
Topics: reinforcement-learning-algorithms
Homepage:
Size: 2.93 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.org
Awesome Lists containing this project

README

        * RL-learning

Reading https://www.datacamp.com/tutorial/reinforcement-learning-python-introduction

*  Q-learning

** Theory

Q-table:

| State | Avaliable actions |

|       | up down left righ |

|     0 | . . . . . . .     |

|     1 | . . . . . . .     |

|     2 | . . . . . . .     |

What

- Initialization: Initializing the Q-table and defining the environment.

- Epsilon-Greedy Policy: Choosing actions based on an epsilon-greedy strategy.

- Q-Table Update: Updating the Q-values using the Q-learning update rule.

- Training Loop: Iterating through episodes and steps to train the agent.

- Testing: Evaluating the trained agent's performance.

Q-learning update formula:

: Q_next = np.max(Q_table[next_state])

: alpha * (reward + gamma * Q_next - Q_table[state, action])

Which is equal to:

(1 - alpha) * Q_table[state, action] + alpha * (reward + gamma * Q_next)

temporal difference (TD) error.

- difference between the current estimate of a value function and a better estimate of that value function

  based on new information.

- calculated as the *difference* between the predicted value at the *current time step* and the predicted value

  at the *next time step*, adjusted by the reward received and the discount factor.

  - δt=rt+1+γ*V(st+1)−V(st) - the TD error at time step t.

    - rt+1 is the reward received at time step t+1.

    - γ is the discount factor, which determines how much future rewards are valued.

    - V(st)/V(st+1) -  is the current/next estimate of the value function at state st/st+1.

- Temporal Difference learning - a class of model-free reinforcement learning methods.

** Implementation

Here we use 30x30 grid. epsilon_greedy_policy showed poor performance

 compared to boltzmann_exploration.

#+begin_src python :results output :exports both :session s1

import numpy as np

import random

class GridWorldEnv:

    def __init__(self, width=5, height=5):

        self.width = width

        self.height = height

        self.state = None

        self.action_space = np.array([0, 1, 2, 3])  # 0: up, 1: down, 2: left, 3: right

        self.n_actions = len(self.action_space)

        self.n_observations = width * height

    def reset(self):

        # self.state = np.array([0, 0])  # Start at the top-left corner

        self.state = 0  # Start at the top-left corner

        return self.state

    def step(self, action):

        # x, y = self.state

        x, y = divmod(self.state, self.width)

        reward = -1  # Default reward for each step

        done = False

        if action == 0:  # Up

            y = max(0, y - 1)

        elif action == 1:  # Down

            y = min(self.height - 1, y + 1)

        elif action == 2:  # Left

            x = max(0, x - 1)

        elif action == 3:  # Right

            x = min(self.width - 1, x + 1)

        # - Calculate proximity reward around target corner

        target_x, target_y = self.width - 1, self.height - 1

        current_distance = abs(x - target_x) + abs(y - target_y)

        next_x, next_y = divmod(self.state, self.width)

        next_distance = abs(next_x - target_x) + abs(next_y - target_y)

        if next_distance >= current_distance:

            reward += 0.5  # Positive reward for moving closer to the target

        # Check if the agent reached the goal (bottom-right corner)

        if x == self.width - 1 and y == self.height - 1:

            reward = 10  # Reward for reaching the goal

            done = True

        self.state = x * self.width + y

        return self.state, reward, done, {}

    def action_space_sample(self):

        return np.random.choice(self.action_space)

# - Envorinment

env = GridWorldEnv(30,30)

# print("env.n_observations, env.n_actions", env.n_observations, env.n_actions)

# 100 , 4

# - initialization with zeroes.

Q_table = np.zeros((env.n_observations, env.n_actions))

alpha = 0.02  # Learning rate

gamma = 0.95  # Discount factor

epsilon = 0.99  # Initial exploration rate

epsilon_min = 0.001  # Minimum exploration rate

epsilon_decay = 0.995  # Exploration rate decay

def epsilon_greedy_policy(Q_table, state, epsilon):

    "Return action."

    if random.uniform(0, 1) < epsilon:

        return env.action_space_sample()

    else:

        return np.argmax(Q_table[state])

def boltzmann_exploration(Q_table, state, temperature):

    # Calculate the exponential values of the Q-values divided by the temperature

    exp_values = np.exp(Q_table[state] / temperature)

    # Calculate the probabilities using the softmax function

    action_probabilities = exp_values / np.sum(exp_values)

    # Choose an action based on the calculated probabilities

    return np.random.choice(len(Q_table[state]), p=action_probabilities)

def update_Q_table(Q_table, state, action, reward, next_state, done):

    Q_next = np.max(Q_table[next_state]) if not done else 0

    Q_table[state, action] = Q_table[state, action] + alpha * (reward + gamma * Q_next - Q_table[state, action])

# - Main Training Loop

episodes = 1000

max_steps = 100

done = False

for episode in range(episodes):

    # Q_table = np.zeros((env.n_observations, env.n_actions))

    state = env.reset()

    rewards = 0.0

    for step in range(max_steps):

        # action = epsilon_greedy_policy(Q_table, state, epsilon)

        action = boltzmann_exploration(Q_table, state, epsilon)

        next_state, reward, done, _ = env.step(action)

        rewards += reward

        # print(divmod(env.state, env.width), action, reward, epsilon)

        update_Q_table(Q_table, state, action, reward, next_state, done)

        state = next_state

        if done:

            break

    # Decay epsilon

    epsilon = max(epsilon_min, epsilon * epsilon_decay)

    # Print rewards every 100 episodes

    if done or episode % max_steps == 0:

        print(f"{done} Episode {episode+1}, Reward: {rewards}, epsilon: {epsilon}, step:{step}")

    if done:

        break

if done:

    # - Testing

    epsilon = 0.01  # Set epsilon low for testing

    test_episodes = 100

    for episode in range(test_episodes):

        state = env.reset()

        done = False

        rewards = 0.0

        for step in range(max_steps):

            # action = epsilon_greedy_policy(Q_table, state, epsilon)

            action = boltzmann_exploration(Q_table, state, epsilon)

            next_state, reward, done, _ = env.step(action)

            rewards += reward

            state = next_state

            if done:

                print("testing success")

                break

        print(f"Test Episode {episode+1}, Reward: {rewards}")

        if done:

            break

# print(Q_table)

#+end_src

#+RESULTS:

#+begin_example

False Episode 1, Reward: -70.0, epsilon: 0.98505, step:99

False Episode 101, Reward: -71.5, epsilon: 0.5967141684651915, step:99

False Episode 201, Reward: -68.5, epsilon: 0.36147180229136094, step:99

False Episode 301, Reward: -66.5, epsilon: 0.21896893145312793, step:99

False Episode 401, Reward: -68.5, epsilon: 0.13264490518426955, step:99

False Episode 501, Reward: -62.5, epsilon: 0.0803523621117462, step:99

False Episode 601, Reward: -64.0, epsilon: 0.0486750854694935, step:99

False Episode 701, Reward: -64.0, epsilon: 0.029485927771078575, step:99

True Episode 709, Reward: -45.0, epsilon: 0.028326925693042973, step:93

testing success

Test Episode 1, Reward: -38.5

#+end_example
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/anoncheg1/rl-learning

Awesome Lists containing this project

README