https://github.com/bjoern-hempel/js-reinforcement-learning-framework
A reinforcement learning framework
https://github.com/bjoern-hempel/js-reinforcement-learning-framework
javascript mathematics mit-license reinforcement-learning reinforcement-learning-algorithms statistics
Last synced: about 2 months ago
JSON representation
A reinforcement learning framework
- Host: GitHub
- URL: https://github.com/bjoern-hempel/js-reinforcement-learning-framework
- Owner: bjoern-hempel
- License: mit
- Created: 2018-08-21T22:47:48.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2018-10-08T22:08:20.000Z (about 7 years ago)
- Last Synced: 2025-05-17T02:06:08.587Z (5 months ago)
- Topics: javascript, mathematics, mit-license, reinforcement-learning, reinforcement-learning-algorithms, statistics
- Language: JavaScript
- Size: 3.41 MB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# A javascript reinforcement learning framework
This is a javascript reinforcement learning framework.
## 1. Introduction
In progress..
## 2. Markov decision process (MDP)
### 2.1 Theory
#### 2.1.1 The bellman equation
#### 2.1.2 The value iteration algorithm
#### 2.1.3 The Q-value iteration algorithm
### 2.2 Usage
#### 2.2.1 Super basic example
Let's look at a state s0 which contains 3 actions which point back to the state s0. The first action a0 receives a reward of 1. The second action a1 receives a penalty of 1 (-1). The third action a2 remains neutral and results in neither reward nor punishment. In this case, each action contains only one possible state change, so the transition probability is 1,0 (100%). The whole setup does not make much sense, but it should show the procedure in more detail:
As one can logically see, a0 is the best option and leads to maximum reward. a1 teaches us punishment and is the most unfavorable variant, while a2 is the neutral version without any reward. Let's calculate that:
##### 2.2.1.1 Code
**The written-out version:**
```javascript
var discountFactor = 0;var rl = new ReinforcementLearning.mdp();
/* s0 */
var s0 = rl.addState();/* create a0, a1 and a2 */
var a0 = rl.addAction(s0);
var a1 = rl.addAction(s0);
var a2 = rl.addAction(s0);/* add the action to state connections (state changes) */
rl.addStateChange(a0, s0, 1.0, 1);
rl.addStateChange(a1, s0, 1.0, -1);
rl.addStateChange(a2, s0, 1.0, 0);var Q = rl.calculateQ(discountFactor);
console.log(JSON.stringify(Q));
```**The short version:**
```javascript
var discountFactor = 0;var rl = new ReinforcementLearning();
/* s0 */
var s0 = rl.addState();/* s0.a0, s0.a1 and s0.a2 */
rl.addAction(s0, new StateChange(s0, 1.0, 1));
rl.addAction(s0, new StateChange(s0, 1.0, -1));
rl.addAction(s0, new StateChange(s0, 1.0, 0));var Q = rl.calculateQ(discountFactor);
console.log(JSON.stringify(Q));
```**It returns:**
```json
[
[1, -1, 0]
]
```As we suspected above: a0 is the winner and with the maximum value of Q(s=0) (Q(s=0,a=0) = 1). The discountFactor is set to 0, because we only want to consider one iteration step. The discountFactor determines the importance of future rewards: A factor of 0 makes the agent "short-sighted" by considering only the current rewards, while a factor of close to 1 makes him strive for a high long-sighted reward. Because it is set to 0, only the next step is important and it shows the previously shown result.
The situation doesn't change if we look a little bit more far-sighted and we set the discount factor close to 1 (e.g. 0,9):
```javascript
var discountFactor = 0.9;
```**It returns:**
```json
[
[9.991404955442832, 7.991404955442832, 8.991404955442832]
]
```Q(s=0,a=0) is still the winner with the maximum of Q(s=0) ≈ 10. The algorithm of the `calculateQ` function stops the iteration of the above Markov formula until the Q change difference falls below a certain threshold: the default value of this threshold is 0,001.
##### 2.2.1.2 Watch at the [demo](demo/rl-super-basic.html)
Discount rate 0,9:
#### 2.2.2 Basic example
Let's look at the next following example:
If we look at this example in the short term, it is a good idea to permanently go through a1 from s0 (discountRate = 0) and stay on state s0: Because we always get a reward of 2 and we don't want to receive any punishment of -5. From a far-sighted point of view, it's better to go through a0 (discountRate = 0,9), because in future we will receive a reward of 10 in addition to the punishment of -5 (it means the sum of 5 reward instead of only 2). Let's calculate that:
##### 2.2.2.1 Code
```javascript
var discountRate = 0.9;var rl = new ReinforcementLearning.mdp();
/* s0 and s1 */
var s0 = rl.addState();
var s1 = rl.addState();/* s0.a0, s0.a1 and s1.a0 */
rl.addAction(s0, new StateChange(s1, 1.0, -5));
rl.addAction(s0, new StateChange(s0, 1.0, 2));/* s1.a0 */
rl.addAction(s1, new StateChange(s0, 1.0, 10));var Q = rl.calculateQ(discountRate);
console.log(JSON.stringify(Q));
```**It returns:**
```json
[
[21.044799074176453, 20.93957918978874],
[28.93957918978874]
]
```As we expected, far-sighted it is better to choose s0.a0 with the reward of Q(s=0,a=0).
##### 2.2.2.2 Watch at the [demo](demo/rl-basic.html)
Discount rate 0,9:
##### 2.2.2.3 Comparison of different discount rates
| discountRate | type | s0 | s1 | s0 (winner) | s1 (winner) |
|--------------|--------------------|------------------|---------------|------------------------|------------------------|
| 0.0 | short-sighted | `[-5, 2]` | `[10]` | a1 | a0 |
| 0.1 | short-sighted | `[-3.98, 2.22]` | `[10.22]` | a1 | a0 |
| 0.5 | half short-sighted | `[1.00, 4.00]` | `[12.00]` | a1 | a0 |
| 0.9 | far-sighted | `[21.04, 20.94]` | `[28.94]` | a0 | a0 |**Graphic:**
#### 2.2.3 More complex example
Let's look at the somewhat more complex example:
Short-sightedly it is a good idea to permanently go through a0 and stay on state s0. But how is it farsighted? Is courage rewarded in this case? Let's calculate that:
##### 2.2.3.1 Code
```javascript
var discountRate = 0.9;var rl = new ReinforcementLearning.mdp();
/* s0, s1 and s2 */
var s0 = rl.addState();
var s1 = rl.addState();
var s2 = rl.addState();/* s0.a0 and s0.a1 */
rl.addAction(s0, new StateChange(s0, 1.0, 1));
rl.addAction(s0, new StateChange(s0, 0.5, -2), new StateChange(s1, 0.5, 0));/* s1.a0 and s1.a1 */
rl.addAction(s1, new StateChange(s1, 1.0, 0));
rl.addAction(s1, new StateChange(s2, 1.0, -50));/* s2.a0 */
rl.addAction(s2, new StateChange(s0, 0.8, 100), new StateChange(s1, 0.1, 0), new StateChange(s2, 0.1, 0));var Q = rl.calculateQ(discountRate);
console.log(JSON.stringify(Q));
```**It returns:**
```json
[
[61.75477734479686, 67.50622243150205],
[76.25766751820726, 84.73165595751362],
[149.70275422340958]
]
```Looking at the example far-sightedly (`discountRate = 0.9`) it is a good idea to take action a1 in status s0, take action a1 in status s1 and take action a0 in status s2.
##### 2.2.3.2 Watch at the [demo](demo/rl-more-complex.html)
Discount rate 0,9:
##### 2.2.3.3 Comparison of different discount rates
| discountRate | type | s0 | s1 | s2 | s0 (winner) | s1 (winner) | s2 (winner) |
|---|-------------------------------------|----------------|----------------|----------|----------------|----------------|----------|
| 0.0 | short-sighted | `[1, -1]` | `[0, -50]` | `[80]` | a0 | a0 | a0 |
| 0.1 | short-sighted | `[1.11, -0.94]` | `[0, -41.91]` | `[80.90]` | a0 | a0 | a0 |
| 0.5 | half short-sighted | `[2, -0.5]` | `[0, -7.47]` | `[85.05]` | a0 | a0 | a0 |
| 0.9 | far-sighted | `[61.76, 67.51]` | `[76.27, 84.74]` | `[149.71]` | a1 | a1 | a0 |**Graphic:**
#### 2.2.4 Real example
##### 2.2.4.1 Code
In progress..
##### 2.2.4.2 Watch at the [demo](demo/rl-real.html)
In progress..
##### 2.2.4.3 Comparison of different discount rates
In progress..
## 3. Temporal Difference Learning and Q-Learning
### 3.1 Theory
#### 3.1.1 Formula
In Progress
### 3.2 Usage
#### 3.2.1 Real example
##### 3.2.1.1 Code
In progress..
##### 3.2.1.2 Watch at the [demo](demo/rl-real-q-learning.html)
In progress..
#### 3.2.2 Simple Grid World
Imagine we have a person who is currently on the field x=5 and y=3. The goal is the safe way to x=1 and y=3. The red fields must be avoided. They are chasms that endanger the person (negative rewards or punishments). Which way should the person go?
Let's calculate that:
##### 3.2.2.1 Code
```javascript
var discountRate = 0.95;
var width = 5;
var height = 3;
var R = {
0: {2: 100},
1: {2: -10},
2: {2: -10, 1: -10},
3: {2: -10},
4: {2: 0, 0: -10}
};/* create the q-learning instance */
var rlQLearning = new ReinforcementLearning.qLearning();/* build the grid world */
rlQLearning.buildGridWorld(width, height, R);/* calculate Q */
var Q = rlQLearning.calculateQ(discountRate, {
iterations: 100000,
useSeededRandom: true,
useOptimizedRandom: true
});/* print result */
rlQLearning.printTableGridWorld(Q, width, R);
```**It returns:**
##### 3.2.2.2 Watch at the [demo](demo/rl-grid-world.html)
In progress.
#### 3.2.3 Extended Grid World
As in the example 3.2.2 but just a bigger grid world:
```javascript
var width = 10;
var height = 5;
var R = {
0: {4: 100},
2: {4: -10},
3: {4: -10, 3: -10},
4: {4: -10},
5: {4: 0, 0: -10}
};
```That's easy now:
Now imagine that the person is drunk. That means with a certain probability the person goes to the right or left, although he wanted to go straight out. Depending on how drunk the person is we choose a probability of 2.5% to go left and a probability of 2.5% to go right (`splitT = 0.025`). What is the safest way now? **Preliminary consideration:** Moving away from the chasms first and staying away from them might now be better than taking the shortest route.
Let's calculate that:
##### 3.2.3.1 Code
```javascript
var discountRate = 0.95;
var width = 10;
var height = 5;
var R = {
0: {4: 100},
2: {4: -10},
3: {4: -10, 3: -10},
4: {4: -10},
5: {4: 0, 0: -10}
};
var splitT = 0.025;/* create the q-learning instance */
var rlQLearning = new ReinforcementLearning.qLearning();
rlQLearning.adoptConfig({splitT: splitT});/* build the grid world */
rlQLearning.buildGridWorld(width, height, R);/* calculate Q */
var Q = rlQLearning.calculateQ(discountRate, {
iterations: 100000,
useSeededRandom: true,
useOptimizedRandom: true
});/* print result */
rlQLearning.printTableGridWorld(Q, width, R);
```**It returns:**
##### 3.2.3.2 Watch at the [demo](demo/rl-grid-world.html)
In progress.
## A. Tools
* All flowcharts were gratefully created with [Google Drive](https://www.google.com/drive/)
## B. Authors
* Björn Hempel - _Initial work_ - [https://github.com/bjoern-hempel](https://github.com/bjoern-hempel)
## C. Licence
This tutorial is licensed under the MIT License - see the [LICENSE.md](/LICENSE.md) file for details
## D. Closing words
Have fun! :)