An open API service indexing awesome lists of open source software.

https://github.com/williamFalcon/DeepRLHacks

Hacks for training RL systems from John Schulman's lecture at Deep RL Bootcamp (Aug 2017)
https://github.com/williamFalcon/DeepRLHacks

Last synced: 11 months ago
JSON representation

Hacks for training RL systems from John Schulman's lecture at Deep RL Bootcamp (Aug 2017)

Awesome Lists containing this project

README

          

# DeepRLHacks
From a talk given by [John Schulman](http://joschu.net/) titled "The Nuts and Bolts of Deep RL Research" (Aug 2017)
These are tricks written down while attending summer [Deep RL Bootcamp at UC Berkeley](https://www.deepbootcamp.io/).

**Update**: RL bootcamp just released the [video](https://www.youtube.com/watch?v=8EcdaCk9KaQ&feature=youtu.be) and the rest of the [lectures](https://sites.google.com/view/deep-rl-bootcamp/lectures).

## Tips to debug new algorithm
1. Simplify the problem by using a low dimensional state space environment.
- John suggested to use the [Pendulum problem](https://gym.openai.com/envs/Pendulum-v0) because the problem has a 2-D state space (angle of pendulum and velocity).
- Easy to visualize what the value function looks like and what state the algorithm should be in and how they evolve over time.
- Easy to visually spot why something isn't working (aka, is the value function smooth enough and so on).

2. To test if your algorithm is reasonable, construct a problem you know it should work on.
- Ex: For hierarchical reinforcement learning you'd construct a problem with an OBVIOUS hierarchy it should learn.
- Can easily see if it's doing the right thing.
- WARNING: Don't over fit method to your toy problem (realize it's a toy problem).

3. Familiarize yourself with certain environments you know well.
- Over time, you'll learn how long the training should take.
- Know how rewards evolve, etc...
- Allows you to set a benchmark to see how well you're doing against your past trials.
- John uses the hopper robot where he knows how fast learning should take, and he can easily spot odd behaviors.

## Tips to debug a new task
1. Simplify the task
- Start simple until you see signs of life.
- Approach 1: Simplify the feature space:
- For example, if you're learning from images (huge dimensional space), then maybe hand engineer features first. Example: If you think your function is trying to approximate a location of something, use the x,y location as features as step 1.
- Once it starts working, make the problem harder until you solve the full problem.
- Approach 2: simplify the reward function.
- Formulate so it can give you FAST feedback to know whether you're doing the right thing or not.
- Ex: Have reward for robot when it hits the target (+1). Hard to learn because maybe too much happens in between starting and reward. Reformulate as distance to target instead which will increase learning and allow you to iterate faster.

## Tips to frame a problem in RL
Maybe it's unclear what the features are and what the reward should be, or if it's feasible at all.

1. First step: Visualize a random policy acting on this problem.
- See where it takes you.
- If random policy on occasion does the right thing, then high chance RL will do the right thing.
- Policy gradient will find this behavior and make it more likely.
- If random policy never does the right thing, RL will likely also not.

2. Make sure observations usable:
- See if YOU could control the system by using the same observations you give the agent.
- Example: Look at preprocessed images yourself to make sure you don't remove necessary details or hinder the algorithm in a certain way.

3. Make sure everything is reasonably scaled.
- Rule of thumb:
- Observations: Make everything mean 0, standard deviation 1.
- Reward: If you control it, then scale it to a reasonable value.
- Do it across ALL your data so far.
- Look at all observations and rewards and make sure there aren't crazy outliers.

4. Have good baseline whenever you see a new problem.
- It's unclear which algorithm will work, so have a set of baselines (from other methods)
- Cross entropy method
- Policy gradient methods
- Some kind of Q-learning method (checkout [OpenAI Baselines](https://github.com/openai/baselines) as a starter or [RLLab](https://github.com/rll/rllab))

## Reproducing papers
Sometimes (often), it's hard to reproduce results from papers. Some tricks to do that:

1. Use more samples than needed.
2. Policy right... but not exactly
- Try to make it work a little bit.
- Then tweak hyper parameters to get up to the public performance.
- If want to get it to work at ALL, use bigger batch sizes.
- If batch size is too small, noisy will overpower signal.
- Example: TRPO, John was using too tiny of a batch size and had to use 100k time steps.
- For DQN, best hyperparams: 10k time steps, 1mm frames in replay buffer.

## Guidelines on-going training process
Sanity check that your training is going well.

1. Look at sensitivity of EVERY hyper parameter
- If algo is too sensitive, then NOT robust and should NOT be happy with it.
- Sometimes it happens that a method works one way because of funny dynamics but NOT in general.

2. Look for indicators that the optimization process is healthy.
- Varies
- Look at whether value function is accurate.
- Is it predicting well?
- Is it predicting returns well?
- How big are the updates?
- Standard diagnostics from deep networks

3. Have a system for continuously benchmarking code.
- Needs DISCIPLINE.
- Look at performance across ALL previous problems you tried.
- Sometimes it'll start working on one problem but mess up performance in others.
- Easy to over fit on a single problem.
- Have a battery of benchmarks you run occasionally.

4. Think your algorithm is working but you're actually seeing random noise.
- Example: Graph of 7 tasks with 3 algorithms and looks like 1 algorithm might be doing best on all problems, but turns out they're all the same algorithm with DIFFERENT random seeds.

5. Try different random seeds!!
- Run multiple times and average.
- Run multiple tasks on multiple seeds.
- If not, you're likely to over fit.

6. Additional algorithm modifications might be unnecessary.
- Most tricks are ACTUALLY normalizing something in some way or improving your optimization.
- A lot of tricks also have the same effect... So you can remove some of them and SIMPLIFY your algorithm (VERY KEY).

7. Simplify your algorithm
- Will generalize better

8. Automate your experiments
- Don't spend your whole day watching your code spit out numbers.
- Launch experiments on cloud services and analyze results.
- Frameworks to track experiments and results:
- Mostly use iPython notebooks.
- DBs seem unnecessary to store results.

## General training strategies
1. Whiten and standardize data (for ALL seen data since the beginning).
- Observations:
- Do it by computing a running mean and standard deviation. Then z-transform everything.
- Over ALL data seen (not just the recent data).
- At least it'll scale down over time how fast it's changing.
- Might trip up the optimizer if you keep changing the objective.
- Rescaling (by using recent data) means your optimizer probably didn't know about that and performance will collapse.

- Rewards:
- Scale and DON'T shift.
- Affects agent's will to live.
- Will change the problem (aka, how long you want it to survive).

- Standardize targets:
- Same way as rewards.

- PCA Whitening?
- Could help.
- Starting to see if it actually helps with neural nets.
- Huge scales (-1000, 1000) or (-0.001, 0.001) certainly make learning slow.

2. Parameters that inform discount factors.
- Determines how far you're giving credit assignment.
- Ex: if factor is 0.99, then you're ignoring what happened 100 steps ago... Means you're shortsighted.
- Better to look at how that corresponds to real time
- Intuition, in RL we're usually discretizing time.
- aka: are those 100 steps 3 seconds of actual time?
- what happens during that time?
- If TD methods for policy gradient of Value fx estimation, gamma can be close to 1 (like 0.999)
- Algo becomes very stable.

3. Look to see that problem can actually be solved in the discretized level.
- Example: In game if you're doing frame skip.
- As a human, can you control it or is it impossible?
- Look at what random exploration looks like
- Discretization determines how far your Brownian motion goes.
- If do many actions in a row, then tend to explore further.
- Choose your time discretization in a way that works.

4. Look at episode returns closely.
- Not just mean, look at min and max.
- The max return is something your policy can hone in pretty well.
- Is your policy ever doing the right thing??
- Look at episode length (sometimes more informative than episode reward).
- if on game you might be losing every time so you might never win, but... episode length can tell you if you're losing SLOWER.
- Might see an episode length improvement in the beginning but maybe not reward.

## Policy gradient diagnostics
1. Look at entropy really carefully
- Entropy in ACTION space
- Care more about entropy in state space, but don't have good methods for calculating that.
- If going down too fast, then policy becoming deterministic and will not explore.
- If NOT going down, then policy won't be good because it is really random.
- Can fix by:
- KL penalty
- Keep entropy from decreasing too quickly.
- Add entropy bonus.
- How to measure entropy.
- For most policies can compute entropy analytically.
- If continuous, it's usually a Gaussian, so can compute differential entropy.

2. Look at KL divergence
- Look at size of updates in terms of KL divergence.
- example:
- If KL is .01 then very small.
- If 10 then too much.

3. Baseline explained variance.
- See if value function is actually a good predictor or a reward.
- if negative it might be overfitting or noisy.
- Likely need to tune hyper parameters

4. Initialize policy
- Very important (more so than in supervised learning).
- Zero or tiny final layer to maximize entropy
- Maximize random exploration in the beginning

## Q-Learning Strategies
1. Be careful about replay buffer memory usage.
- You might need a huge buffer, so adapt code accordingly.

2. Play with learning rate schedule.

3. If converges slowly or has slow warm-up period in the beginning
- Be patient... DQN converges VERY slowly.

## Bonus from [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/):
1. A good feature can be to take the difference between two frames.
- This delta vector can highlight slight state changes otherwise difficult to distinguish.