https://github.com/bjoern-hempel/js-reinforcement-learning-framework

A reinforcement learning framework
https://github.com/bjoern-hempel/js-reinforcement-learning-framework
javascript mathematics mit-license reinforcement-learning reinforcement-learning-algorithms statistics
Last synced: 5 months ago
JSON representation
A reinforcement learning framework
Host: GitHub
URL: https://github.com/bjoern-hempel/js-reinforcement-learning-framework
Owner: bjoern-hempel
License: mit
Created: 2018-08-21T22:47:48.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2018-10-08T22:08:20.000Z (over 6 years ago)
Last Synced: 2024-12-27T03:13:15.096Z (6 months ago)
Topics: javascript, mathematics, mit-license, reinforcement-learning, reinforcement-learning-algorithms, statistics
Language: JavaScript
Size: 3.41 MB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project

README

        # A javascript reinforcement learning framework

This is a javascript reinforcement learning framework.

## 1. Introduction

In progress.. 

## 2. Markov decision process (MDP)

### 2.1 Theory

#### 2.1.1 The bellman equation



#### 2.1.2 The value iteration algorithm



#### 2.1.3 The Q-value iteration algorithm



### 2.2 Usage

#### 2.2.1 Super basic example

Let's look at a state s₀ which contains 3 actions which point back to the state s₀. The first action a₀ receives a reward of 1. The second action a₁ receives a penalty of 1 (-1). The third action a₂ remains neutral and results in neither reward nor punishment. In this case, each action contains only one possible state change, so the transition probability is 1,0 (100%). The whole setup does not make much sense, but it should show the procedure in more detail:



As one can logically see, a₀ is the best option and leads to maximum reward. a₁ teaches us punishment and is the most unfavorable variant, while a₂ is the neutral version without any reward. Let's calculate that:

##### 2.2.1.1 Code

**The written-out version:**

```javascript

var discountFactor = 0;

var rl = new ReinforcementLearning.mdp();

/* s0 */

var s0 = rl.addState();

/* create a0, a1 and a2 */

var a0 = rl.addAction(s0);

var a1 = rl.addAction(s0);

var a2 = rl.addAction(s0);

/* add the action to state connections (state changes) */

rl.addStateChange(a0, s0, 1.0,  1);

rl.addStateChange(a1, s0, 1.0, -1);

rl.addStateChange(a2, s0, 1.0,  0);

var Q = rl.calculateQ(discountFactor);

console.log(JSON.stringify(Q));

```

**The short version:**

```javascript

var discountFactor = 0;

var rl = new ReinforcementLearning();

/* s0 */

var s0 = rl.addState();

/* s0.a0, s0.a1 and s0.a2 */

rl.addAction(s0, new StateChange(s0, 1.0,  1));

rl.addAction(s0, new StateChange(s0, 1.0, -1));

rl.addAction(s0, new StateChange(s0, 1.0,  0));

var Q = rl.calculateQ(discountFactor);

console.log(JSON.stringify(Q));

```

**It returns:**

```json

[

    [1, -1, 0]

]

```

As we suspected above: a₀ is the winner and with the maximum value of Q_(s=0) (Q_(s=0,a=0) = 1). The discountFactor is set to 0, because we only want to consider one iteration step. The discountFactor determines the importance of future rewards: A factor of 0 makes the agent "short-sighted" by considering only the current rewards, while a factor of close to 1 makes him strive for a high long-sighted reward. Because it is set to 0, only the next step is important and it shows the previously shown result.

The situation doesn't change if we look a little bit more far-sighted and we set the discount factor close to 1 (e.g. 0,9):

```javascript

var discountFactor = 0.9;

```

**It returns:**

```json

[

    [9.991404955442832, 7.991404955442832, 8.991404955442832]

]

```

Q_(s=0,a=0) is still the winner with the maximum of Q_(s=0) ≈ 10. The algorithm of the `calculateQ` function stops the iteration of the above Markov formula until the Q change difference falls below a certain threshold: the default value of this threshold is 0,001.

##### 2.2.1.2 Watch at the [demo](demo/rl-super-basic.html)

Discount rate 0,9:



#### 2.2.2 Basic example

Let's look at the next following example:



If we look at this example in the short term, it is a good idea to permanently go through a₁ from s₀ (discountRate = 0) and stay on state s₀: Because we always get a reward of 2 and we don't want to receive any punishment of -5. From a far-sighted point of view, it's better to go through a₀ (discountRate = 0,9), because in future we will receive a reward of 10 in addition to the punishment of -5 (it means the sum of 5 reward instead of only 2). Let's calculate that:  

##### 2.2.2.1 Code

```javascript

var discountRate = 0.9;

var rl = new ReinforcementLearning.mdp();

/* s0 and s1 */

var s0 = rl.addState();

var s1 = rl.addState();

/* s0.a0, s0.a1 and s1.a0 */

rl.addAction(s0, new StateChange(s1, 1.0, -5));

rl.addAction(s0, new StateChange(s0, 1.0,  2));

/* s1.a0 */

rl.addAction(s1, new StateChange(s0, 1.0, 10));

var Q = rl.calculateQ(discountRate);

console.log(JSON.stringify(Q));

```

**It returns:**

```json

[

    [21.044799074176453, 20.93957918978874],

    [28.93957918978874]

]

```

As we expected, far-sighted it is better to choose s₀.a₀ with the reward of Q_(s=0,a=0).

##### 2.2.2.2 Watch at the [demo](demo/rl-basic.html)

Discount rate 0,9:



##### 2.2.2.3 Comparison of different discount rates

| discountRate | type               | s₀    | s₁ | s₀ (winner) | s₁ (winner) |

|--------------|--------------------|------------------|---------------|------------------------|------------------------|

| 0.0          | short-sighted      | `[-5, 2]`        | `[10]`        | a₁          | a₀          |

| 0.1          | short-sighted      | `[-3.98, 2.22]`  | `[10.22]`     | a₁          | a₀          |

| 0.5          | half short-sighted | `[1.00, 4.00]`   | `[12.00]`     | a₁          | a₀          |

| 0.9          | far-sighted        | `[21.04, 20.94]` | `[28.94]`     | a₀          | a₀          |

**Graphic:**



#### 2.2.3 More complex example

Let's look at the somewhat more complex example:



Short-sightedly it is a good idea to permanently go through a₀ and stay on state s₀. But how is it farsighted? Is courage rewarded in this case? Let's calculate that:

##### 2.2.3.1 Code

```javascript

var discountRate =  0.9;

var rl = new ReinforcementLearning.mdp();

/* s0, s1 and s2 */

var s0 = rl.addState();

var s1 = rl.addState();

var s2 = rl.addState();

/* s0.a0 and s0.a1 */

rl.addAction(s0, new StateChange(s0, 1.0,   1));

rl.addAction(s0, new StateChange(s0, 0.5,  -2), new StateChange(s1, 0.5, 0));

/* s1.a0 and s1.a1 */

rl.addAction(s1, new StateChange(s1, 1.0,   0));

rl.addAction(s1, new StateChange(s2, 1.0, -50));

/* s2.a0 */

rl.addAction(s2, new StateChange(s0, 0.8, 100), new StateChange(s1, 0.1, 0), new StateChange(s2, 0.1, 0));

var Q = rl.calculateQ(discountRate);

console.log(JSON.stringify(Q));

```

**It returns:**

```json

[

    [61.75477734479686, 67.50622243150205],

    [76.25766751820726, 84.73165595751362],

    [149.70275422340958]

]

```

Looking at the example far-sightedly (`discountRate = 0.9`) it is a good idea to take action a₁ in status s₀, take action a₁ in status s₁ and take action a₀ in status s₂. 

##### 2.2.3.2 Watch at the [demo](demo/rl-more-complex.html)

Discount rate 0,9:



##### 2.2.3.3 Comparison of different discount rates

| discountRate | type | s₀ | s₁ | s₂ | s₀ (winner) | s₁ (winner) | s₂ (winner) |

|---|-------------------------------------|----------------|----------------|----------|----------------|----------------|----------|

| 0.0 | short-sighted | `[1, -1]` | `[0, -50]` | `[80]` | a₀ | a₀ | a₀ |

| 0.1 | short-sighted | `[1.11, -0.94]` | `[0, -41.91]` | `[80.90]` | a₀ | a₀ | a₀ |

| 0.5 | half short-sighted | `[2, -0.5]` | `[0, -7.47]` | `[85.05]` | a₀ | a₀ | a₀ |

| 0.9 | far-sighted | `[61.76, 67.51]` | `[76.27, 84.74]` | `[149.71]` | a₁ | a₁ | a₀ |

**Graphic:**









#### 2.2.4 Real example

##### 2.2.4.1 Code

In progress..

##### 2.2.4.2 Watch at the [demo](demo/rl-real.html)

In progress..

##### 2.2.4.3 Comparison of different discount rates

In progress..

## 3. Temporal Difference Learning and Q-Learning

### 3.1 Theory

#### 3.1.1 Formula

In Progress

### 3.2 Usage

#### 3.2.1 Real example

##### 3.2.1.1 Code

In progress..

##### 3.2.1.2 Watch at the [demo](demo/rl-real-q-learning.html)

In progress..

#### 3.2.2 Simple Grid World

Imagine we have a person who is currently on the field x=5 and y=3. The goal is the safe way to x=1 and y=3. The red fields must be avoided. They are chasms that endanger the person (negative rewards or punishments). Which way should the person go?



Let's calculate that:

##### 3.2.2.1 Code

```javascript

var discountRate = 0.95;

var width  = 5;

var height = 3;

var R      = {

    0: {2: 100},

    1: {2: -10},

    2: {2: -10, 1: -10},

    3: {2: -10},

    4: {2: 0, 0: -10}

};

/* create the q-learning instance */

var rlQLearning = new ReinforcementLearning.qLearning();

/* build the grid world */

rlQLearning.buildGridWorld(width, height, R);

/* calculate Q */

var Q = rlQLearning.calculateQ(discountRate, {

    iterations: 100000,

    useSeededRandom: true,

    useOptimizedRandom: true

});

/* print result */

rlQLearning.printTableGridWorld(Q, width, R);

```

**It returns:**



##### 3.2.2.2 Watch at the [demo](demo/rl-grid-world.html)

In progress.

#### 3.2.3 Extended Grid World

As in the example 3.2.2 but just a bigger grid world:

```javascript

var width  = 10;

var height = 5;

var R      = {

    0: {4: 100},

    2: {4: -10},

    3: {4: -10, 3: -10},

    4: {4: -10},

    5: {4: 0, 0: -10}

};

```

That's easy now:



Now imagine that the person is drunk. That means with a certain probability the person goes to the right or left, although he wanted to go straight out. Depending on how drunk the person is we choose a probability of 2.5% to go left and a probability of 2.5% to go right (`splitT = 0.025`). What is the safest way now? **Preliminary consideration:** Moving away from the chasms first and staying away from them might now be better than taking the shortest route.

Let's calculate that:

##### 3.2.3.1 Code

```javascript

var discountRate = 0.95;

var width  = 10;

var height = 5;

var R      = {

    0: {4: 100},

    2: {4: -10},

    3: {4: -10, 3: -10},

    4: {4: -10},

    5: {4: 0, 0: -10}

};

var splitT = 0.025;

/* create the q-learning instance */

var rlQLearning = new ReinforcementLearning.qLearning();

rlQLearning.adoptConfig({splitT: splitT});

/* build the grid world */

rlQLearning.buildGridWorld(width, height, R);

/* calculate Q */

var Q = rlQLearning.calculateQ(discountRate, {

    iterations: 100000,

    useSeededRandom: true,

    useOptimizedRandom: true

});

/* print result */

rlQLearning.printTableGridWorld(Q, width, R);

```

**It returns:**



##### 3.2.3.2 Watch at the [demo](demo/rl-grid-world.html)

In progress.

## A. Tools

* All flowcharts were gratefully created with [Google Drive](https://www.google.com/drive/)

## B. Authors

* Björn Hempel  - _Initial work_ - [https://github.com/bjoern-hempel](https://github.com/bjoern-hempel)

## C. Licence

This tutorial is licensed under the MIT License - see the [LICENSE.md](/LICENSE.md) file for details

## D. Closing words

Have fun! :)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bjoern-hempel/js-reinforcement-learning-framework

Awesome Lists containing this project

README