https://github.com/po-hsun-su/dprl

Deep reinforcement learning package for torch7
https://github.com/po-hsun-su/dprl
advantage-actor-critic deep-reinforcement-learning dqn torch7
Last synced: 2 months ago
JSON representation
Deep reinforcement learning package for torch7
Host: GitHub
URL: https://github.com/po-hsun-su/dprl
Owner: Po-Hsun-Su
License: other
Created: 2016-02-03T07:25:23.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2016-09-17T14:53:31.000Z (over 8 years ago)
Last Synced: 2025-03-20T23:33:50.979Z (3 months ago)
Topics: advantage-actor-critic, deep-reinforcement-learning, dqn, torch7
Language: Lua
Size: 95.7 KB
Stars: 16
Watchers: 4
Forks: 8
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        ## dprl

Deep reinforcement learning package for torch7. 

Algorithms:

* Deep Q-learning [[1]](#references)

* Double DQN [[2]](#references)

* Bootstrapped DQN (broken) [[3]](#references)

* Asynchronous advantage actor-critic [[4]](#references)

## Installation

```

git clone https://github.com/PoHsunSu/dprl.git

cd dprl

luarocks make dprl-scm-1.rockspec

```

## Example

#### Play Catch using double deep Q-learning

[Script](https://github.com/PoHsunSu/dprl/blob/master/example/test-dql-catch.lua)

#### Play Catch using asynchronous advantage actor-critic

## Library

The library provides implementation of deep reinforcement learning algorithms.

###  Deep Q-learning (dprl.dql)

This class contains learning and testing procedures for **d**eep **Q**-**l**earning [[1]](#references).

#### dprl.dql(dqn, env, config[, statePreprop[, actPreprop]])

This is the constructor of `dql`. Its arguments are:

* `dqn`: a deep Q-network ([dprl.dqn](#dqn)) or double DQN ([dprl.ddqn](#ddqn))

* `env`: an environment with interfaces defined in [rlenvs](https://github.com/Kaixhin/rlenvs#api)

* `config`: a table containing configurations of `dql`

	* `step`: number of steps before an episode terminates

	* `updatePeriod`: number of steps between successive updates of target Q-network

* `statePreprop`: a function which receives observation from `env` as argument and returns state for `dqn`. See [test-dql-catch.lua](https://github.com/PoHsunSu/dprl/blob/master/example/test-dql-catch.lua) for example

* `actPreprop`: a function which receives output of `dqn` and returns action for `env`. See [test-dql-catch.lua](https://github.com/PoHsunSu/dprl/blob/master/example/test-dql-catch.lua) for example

#### dql:learn(episode[, report])

This method implements learning procedure of `dql`. Its arguments are:

* `episode`: number of episodes which `dql` learns for

* `report`: a function called at each step for reporting the status of learning. Its inputs are transition, current step number, and current episode number. A transition contains the following keys:

	* `s`: current state

	* `a`: current action

	* `r`: reward given action `a` at state `s`

	* `ns`: next state given action `a` at state `s`

	* `t`: boolean value telling whether `ns` is terminal state or not

You can use `report` to compute total reward of an episode or print the estimated Q value by `dqn`. See [test-dql-catch.lua](https://github.com/PoHsunSu/dprl/blob/master/example/test-dql-catch.lua) for example.

#### dql:test(episode[, report])

This method implements test procedure of `dql`. Its arguments are:

* `episode`: number of episodes which `dql` tests for

* `report`: see `report` in [dql:learn](#dql:learn)

###  Deep Q-network (dprl.dqn)

"dqn" means **d**eep **Q**-**n**etwork [[1]](#references). It is the back-end of `dql`. It implements interfaces to train the underlying neural network model. It also implements experiment replay. 

#### dprl.dqn(qnet, config, optim, optimConfig)

This is the constructor of `dqn`. Its arguments are:

* `qnet`: a neural network model built with the [nn](https://github.com/torch/nn) package. Its input is always **mini-batch of states** whose dimension defined by `statePreprop` (see [dprl.dql](#dprl.dql)). Its output is estimated Q values of all possible actions.

* `config`: a table containing the following configurations of `dqn`

	* `replaySize`: size of replay memory	

	* `batchSize`: size of mini-batch of training cases sampled on each replay

	* `discount`: discount factor of reward

	* `epsilon`: the ε of ε-greedy exploration

* `optim`: optimization in the [optim](https://github.com/torch/optim) package for training `qnet`. 

* `optimConfig`: configuration of `optim`

###  Double Deep Q-network (dprl.ddqn)

"ddqn" means **d**ouble **d**eep **Q**-**n**etwork [[2]](#references). It inherets from `dprl.dqn`. We get double deep Q-learning by giving  [`ddqn`](#ddqn), instead of [`dqn`](#dqn), to [`dql`](#dql) . 

The only difference of `dprl.ddqn` to `dprl.dqn` is how it compute target Q-value. `dprl.ddqn` is recommended because it alleviates the over-estimation problem of `dprl.dqn` [[2]](#references). 

#### dprl.dqnn(qnet, config, optim, optimConfig)

This is the constructor of `dprl.dqnn`. Its arguments are identical to [dprl.dqn](#dprl.dqn).

### Bootstrapped deep Q-learning (dprl.bdql) (Not working now)

`dprl.bdql` implements learning procedure in Bootstrapped DQN. Except initialization, its usage is identical to `dprl.dql`.

1. Initialize a bootstrapped deep Q-learning agent. 

	```

	local bdql = dprl.bdql(bdqn, env, config, statePreprop, actPreprop)

	```

	Except the first arguments `bdqn`, which is an instance of [`dprl.bdqn`](#bdqn), definitions of the other arguments are the same in [`dprl.dql`](#dql).

### Bootstrapped deep Q-network (dprl.bdqn)

`dprl.bdqn` inherets [`dprl.dqn`](#dqn). It is customized for Bootstrapped Deep Q-network.

1. Initialize `dprl.bdqn`

	```

	local bdqn = dprl.bdqn(bqnet, config, optim, optimConfig)

	```

	

	arguments:

	* `bqnet`: a bootstrapped neural network with module [`Bootstrap`](#Bootstrap). 

	* `config`: a table containing the following configurations for `bdqn`

		* `replaySize`, `batchSize`, `discount` ,and `epsilon`: see `config` in [`dprl.dqn`](#dqn).	

		* `headNum`: the number of heads in bootstrapped neural network `bqnet`.

	* `optim`: see `optim` in [`dprl.dqn`](#dqn).

	* `optimConfig`: see `optimConfig` in [`dprl.dqn`](#dqn).

### Asynchronous learning (dprl.asyncl)

`dprl.asyncl` is the framework for asynchronous learning [[4]](#references). It manages the multi-threaded procedure in asynchronous learning. The asynchronous advantage actor critic (a3c) algorithm is realized by providing adventage actor critic agent ([dprl.aac](#aac)) to Asynchronous learning (`dprl.asyncl`). See [test-a3c-atari.lua](https://github.com/PoHsunSu/dprl/blob/master/example/test-a3c-atari.lua) for example.

####  dprl.asyncl(asynclAgent, env, config[, statePreprop[, actPreprop[, rewardPreprop]]])

This is the constructor of `dprl.asyncl`. Its arguments are:

* `asynclAgent`: a learning agent such as avantage actor-critic ([dprl.aac](#aac)).

* `env`: an eviroment with interfaces defined in [rlenvs](https://github.com/Kaixhin/rlenvs#api)

* `config`: a table containing configurations of `asyncl`

	* `nthread`: number of actor-leaner threads

	* `maxSteps`: maximum number of steps in an episode on testing

	* `loadPackage`: an optional function called before construting actor-learner thread for loading package

	* `loadEnv`: an optional function to load enviroment (`env`) in actor-learner thread if the enviroment is not serializable. Enviroments written in Lua are serializable but those written in C like Atari emulator are not serializable. See in [test-a3c-atari.lua](https://github.com/PoHsunSu/dprl/blob/master/example/test-dql-catch.lua) for example.

* `statePreprop`: a function which receives observation from `env` as argument and returns state for `asynclAgent`. See [test-a3c-catch.lua](https://github.com/PoHsunSu/dprl/blob/master/example/test-dql-catch.lua) for example.

* `actPreprop`: a function which receives output of `aac` or other learning agent and returns action for `env`. See [test-a3c-catch.lua](https://github.com/PoHsunSu/dprl/blob/master/example/test-dql-catch.lua) for example.

* `rewardPreprop`: a function which receives reward from `env` as argument and returns processed reward for `asynclAgent`. See reward clippling in [test-a3c-atari.lua](https://github.com/PoHsunSu/dprl/blob/master/example/test-dql-catch.lua) for example.

####  dprl.asyncl:learn(Tmax[, report])

This method is the learning procedure of `dprl.asyncl`. Its arguments are:

* `Tmax`: the limit of total (global) learning steps of all actor-learner threads

* `report`: a function called at each step for reporting the status of learning. Its arguments are transition, count of learning steps in a thread, and count of global learning steps. A transition contains the following keys:

	* `s`: current state

	* `a`: current action

	* `r`: reward given action `a` at state `s`

	* `ns`: next state given action `a` at state `s`

	* `t`: boolean value telling whether `ns` is terminal state or not

### Advantage actor-critic (dprl.aac)

This class implements routines called by the asynchronous framework `dprl.asyncl` for realizing the Asynchronous Advantage Actor-Critic (a3c) algorithm [[4]](#references). It trains the actor and critic neural network model that should be provided on construction.

####  dprl.aac(anet, cnet, config, optim, optimConfig)

This is the constructor of `dprl.aac`. Its arguments are:

* `anet`: the actor network. Its input and output must conform to the requirement of environment `env` in `dprl.asyncl`. Note that you can use `statePreprop` and `actPreprop` in `dprl.asyncl`.

* `cnet`: the critic network. Its input is the same as `anet` and its output must be a single value tensor.

* `config`: a table containing configurations of `aac`

	* `tmax`: number of steps between performing asynchronous update of global parameters

	* `discount`: discount factor for computing total reward

	* `criticGradScale`: a scale multiplying the gradient parameters of critic network `cnet` to allow different learning rate between actor network and critic network.

To share parameters between actor network and critic network, make sure gradient parameters (i.e. `'gradWeight'` and `'gradBias'`) is shared as well. For example, 

```

local cnet = anet:clone('weight','bias','gradWeight','gradBias')

```

## Modules

#### Bootstrap (`nn.Bootstrap`)

This module is for constructing bootstrapped network [[3]](#references). Let the shared network be `shareNet` and the head network be `headNet`. A bootstrapped network `bqnet` for `dprl.bdqn` can be constructed as follows:

```

require 'Bootstrap'

-- Definition of 'shareNet' and head 'headNet'

-- Decorate headNet with nn.Bootstrap

local boostrappedHeadNet = nn.Bootstrap(headNet, headNum, param_init)

-- Connect shareNet and boostrappedHeadNet

local bqnet = nn.Sequential():add(shareNet):add(boostrappedHeadNet)

```

`headNum`: the number of heads of the bootstrapped network

`param_init`: a scalar value controlling the range or variance of parameter initialization in headNet.

It is passed to method `headNet:reset(param_init)` after constructing clones of headNet.

## References

[1] Volodymyr Mnih et al., “Human-Level Control through Deep Reinforcement Learning,” Nature 518, no. 7540 (February 26, 2015): 529–33, doi:10.1038/nature14236.

[2] Hado van Hasselt, Arthur Guez, and David Silver, “Deep Reinforcement Learning with Double Q-Learning,” arXiv:1509.06461, September 22, 2015, http://arxiv.org/abs/1509.06461.

[3] Ian Osband et al., “Deep Exploration via Bootstrapped DQN,” arXiv:1602.04621, February 15, 2016, http://arxiv.org/abs/1602.04621.

[4] Volodymyr Mnih et al., “Asynchronous Methods for Deep Reinforcement Learning,” arXiv:1602.01783, February 4, 2016, http://arxiv.org/abs/1602.01783.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/po-hsun-su/dprl

Awesome Lists containing this project

README