An open API service indexing awesome lists of open source software.

https://github.com/anoncheg1/rl-learning

Numpy only implementations of RL algorithms
https://github.com/anoncheg1/rl-learning

reinforcement-learning-algorithms

Last synced: 3 months ago
JSON representation

Numpy only implementations of RL algorithms

Awesome Lists containing this project

README

        

* RL-learning
Reading https://www.datacamp.com/tutorial/reinforcement-learning-python-introduction
* Q-learning
** Theory
Q-table:

| State | Avaliable actions |
| | up down left righ |
| 0 | . . . . . . . |
| 1 | . . . . . . . |
| 2 | . . . . . . . |

What
- Initialization: Initializing the Q-table and defining the environment.
- Epsilon-Greedy Policy: Choosing actions based on an epsilon-greedy strategy.
- Q-Table Update: Updating the Q-values using the Q-learning update rule.
- Training Loop: Iterating through episodes and steps to train the agent.
- Testing: Evaluating the trained agent's performance.

Q-learning update formula:
: Q_next = np.max(Q_table[next_state])
: alpha * (reward + gamma * Q_next - Q_table[state, action])

Which is equal to:
(1 - alpha) * Q_table[state, action] + alpha * (reward + gamma * Q_next)

temporal difference (TD) error.
- difference between the current estimate of a value function and a better estimate of that value function
based on new information.
- calculated as the *difference* between the predicted value at the *current time step* and the predicted value
at the *next time step*, adjusted by the reward received and the discount factor.
- δt​=rt+1​+γ*V(st+1​)−V(st​) - the TD error at time step t.
- rt+1​ is the reward received at time step t+1.
- γ is the discount factor, which determines how much future rewards are valued.
- V(st​)/V(st+1)​ - is the current/next estimate of the value function at state st/st+1.
- Temporal Difference learning - a class of model-free reinforcement learning methods.

** Implementation
Here we use 30x30 grid. epsilon_greedy_policy showed poor performance
compared to boltzmann_exploration.
#+begin_src python :results output :exports both :session s1
import numpy as np
import random

class GridWorldEnv:
def __init__(self, width=5, height=5):
self.width = width
self.height = height
self.state = None
self.action_space = np.array([0, 1, 2, 3]) # 0: up, 1: down, 2: left, 3: right
self.n_actions = len(self.action_space)
self.n_observations = width * height

def reset(self):
# self.state = np.array([0, 0]) # Start at the top-left corner
self.state = 0 # Start at the top-left corner
return self.state

def step(self, action):
# x, y = self.state
x, y = divmod(self.state, self.width)

reward = -1 # Default reward for each step
done = False

if action == 0: # Up
y = max(0, y - 1)
elif action == 1: # Down
y = min(self.height - 1, y + 1)
elif action == 2: # Left
x = max(0, x - 1)
elif action == 3: # Right
x = min(self.width - 1, x + 1)

# - Calculate proximity reward around target corner
target_x, target_y = self.width - 1, self.height - 1
current_distance = abs(x - target_x) + abs(y - target_y)
next_x, next_y = divmod(self.state, self.width)
next_distance = abs(next_x - target_x) + abs(next_y - target_y)
if next_distance >= current_distance:
reward += 0.5 # Positive reward for moving closer to the target

# Check if the agent reached the goal (bottom-right corner)
if x == self.width - 1 and y == self.height - 1:
reward = 10 # Reward for reaching the goal
done = True

self.state = x * self.width + y

return self.state, reward, done, {}

def action_space_sample(self):
return np.random.choice(self.action_space)

# - Envorinment
env = GridWorldEnv(30,30)

# print("env.n_observations, env.n_actions", env.n_observations, env.n_actions)
# 100 , 4
# - initialization with zeroes.
Q_table = np.zeros((env.n_observations, env.n_actions))
alpha = 0.02 # Learning rate
gamma = 0.95 # Discount factor
epsilon = 0.99 # Initial exploration rate
epsilon_min = 0.001 # Minimum exploration rate
epsilon_decay = 0.995 # Exploration rate decay

def epsilon_greedy_policy(Q_table, state, epsilon):
"Return action."
if random.uniform(0, 1) < epsilon:
return env.action_space_sample()
else:
return np.argmax(Q_table[state])

def boltzmann_exploration(Q_table, state, temperature):
# Calculate the exponential values of the Q-values divided by the temperature
exp_values = np.exp(Q_table[state] / temperature)
# Calculate the probabilities using the softmax function
action_probabilities = exp_values / np.sum(exp_values)
# Choose an action based on the calculated probabilities
return np.random.choice(len(Q_table[state]), p=action_probabilities)

def update_Q_table(Q_table, state, action, reward, next_state, done):
Q_next = np.max(Q_table[next_state]) if not done else 0
Q_table[state, action] = Q_table[state, action] + alpha * (reward + gamma * Q_next - Q_table[state, action])

# - Main Training Loop
episodes = 1000
max_steps = 100
done = False
for episode in range(episodes):
# Q_table = np.zeros((env.n_observations, env.n_actions))
state = env.reset()
rewards = 0.0

for step in range(max_steps):
# action = epsilon_greedy_policy(Q_table, state, epsilon)
action = boltzmann_exploration(Q_table, state, epsilon)

next_state, reward, done, _ = env.step(action)
rewards += reward
# print(divmod(env.state, env.width), action, reward, epsilon)
update_Q_table(Q_table, state, action, reward, next_state, done)

state = next_state

if done:
break

# Decay epsilon
epsilon = max(epsilon_min, epsilon * epsilon_decay)

# Print rewards every 100 episodes
if done or episode % max_steps == 0:
print(f"{done} Episode {episode+1}, Reward: {rewards}, epsilon: {epsilon}, step:{step}")
if done:
break

if done:
# - Testing
epsilon = 0.01 # Set epsilon low for testing
test_episodes = 100

for episode in range(test_episodes):
state = env.reset()
done = False
rewards = 0.0

for step in range(max_steps):
# action = epsilon_greedy_policy(Q_table, state, epsilon)
action = boltzmann_exploration(Q_table, state, epsilon)
next_state, reward, done, _ = env.step(action)
rewards += reward

state = next_state

if done:
print("testing success")
break

print(f"Test Episode {episode+1}, Reward: {rewards}")
if done:
break

# print(Q_table)
#+end_src

#+RESULTS:
#+begin_example
False Episode 1, Reward: -70.0, epsilon: 0.98505, step:99
False Episode 101, Reward: -71.5, epsilon: 0.5967141684651915, step:99
False Episode 201, Reward: -68.5, epsilon: 0.36147180229136094, step:99
False Episode 301, Reward: -66.5, epsilon: 0.21896893145312793, step:99
False Episode 401, Reward: -68.5, epsilon: 0.13264490518426955, step:99
False Episode 501, Reward: -62.5, epsilon: 0.0803523621117462, step:99
False Episode 601, Reward: -64.0, epsilon: 0.0486750854694935, step:99
False Episode 701, Reward: -64.0, epsilon: 0.029485927771078575, step:99
True Episode 709, Reward: -45.0, epsilon: 0.028326925693042973, step:93
testing success
Test Episode 1, Reward: -38.5
#+end_example