{"id":13935395,"url":"https://github.com/williamFalcon/DeepRLHacks","last_synced_at":"2025-07-19T20:32:32.684Z","repository":{"id":75973129,"uuid":"101640844","full_name":"williamFalcon/DeepRLHacks","owner":"williamFalcon","description":"Hacks for training RL systems from John Schulman's lecture at Deep RL Bootcamp  (Aug 2017)","archived":false,"fork":false,"pushed_at":"2017-10-13T12:45:17.000Z","size":9,"stargazers_count":1094,"open_issues_count":1,"forks_count":123,"subscribers_count":50,"default_branch":"master","last_synced_at":"2024-10-15T11:25:53.387Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/williamFalcon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-08-28T12:27:10.000Z","updated_at":"2024-08-16T18:15:56.000Z","dependencies_parsed_at":null,"dependency_job_id":"081ae6b9-d6ec-4313-8a2b-845f4d392875","html_url":"https://github.com/williamFalcon/DeepRLHacks","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/williamFalcon%2FDeepRLHacks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/williamFalcon%2FDeepRLHacks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/williamFalcon%2FDeepRLHacks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/williamFalcon%2FDeepRLHacks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/williamFalcon","download_url":"https://codeload.github.com/williamFalcon/DeepRLHacks/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226676966,"owners_count":17665998,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-07T23:01:41.421Z","updated_at":"2024-11-27T03:30:34.850Z","avatar_url":"https://github.com/williamFalcon.png","language":null,"funding_links":[],"categories":["Others"],"sub_categories":[],"readme":"# DeepRLHacks  \nFrom a talk given by [John Schulman](http://joschu.net/) titled \"The Nuts and Bolts of Deep RL Research\" (Aug 2017)   \nThese are tricks written down while attending summer [Deep RL Bootcamp at UC Berkeley](https://www.deepbootcamp.io/).   \n\n**Update**: RL bootcamp just released the [video](https://www.youtube.com/watch?v=8EcdaCk9KaQ\u0026feature=youtu.be) and the rest of the [lectures](https://sites.google.com/view/deep-rl-bootcamp/lectures). \n\n## Tips to debug new algorithm   \n1. Simplify the problem by using a low dimensional state space environment.      \n    - John suggested to use the [Pendulum problem](https://gym.openai.com/envs/Pendulum-v0) because the problem has a 2-D state space (angle of pendulum and velocity).    \n    - Easy to visualize what the value function looks like and what state the algorithm should be in and how they evolve over time.  \n    - Easy to visually spot why something isn't working (aka, is the value function smooth enough and so on).\n\n2. To test if your algorithm is reasonable, construct a problem you know it should work on.   \n    - Ex: For hierarchical reinforcement learning you'd construct a problem with an OBVIOUS hierarchy it should learn. \n    - Can easily see if it's doing the right thing.   \n    - WARNING: Don't over fit method to your toy problem (realize it's a toy problem).   \n\n3. Familiarize yourself with certain environments you know well.\n    - Over time, you'll learn how long the training should take.   \n    - Know how rewards evolve, etc... \n    - Allows you to set a benchmark to see how well you're doing against your past trials.    \n    - John uses the hopper robot where he knows how fast learning should take, and he can easily spot odd behaviors.    \n\n## Tips to debug a new task   \n1. Simplify the task\n    - Start simple until you see signs of life.   \n    - Approach 1: Simplify the feature space: \n      - For example, if you're learning from images (huge dimensional space), then maybe hand engineer features first. Example: If you think your function is trying to approximate a location of something, use the x,y location as features as step 1. \n      - Once it starts working, make the problem harder until you solve the full problem.   \n   - Approach 2: simplify the reward function.\n      - Formulate so it can give you FAST feedback to know whether you're doing the right thing or not.   \n      - Ex: Have reward for robot when it hits the target (+1). Hard to learn because maybe too much happens in between starting and reward. Reformulate as distance to target instead which will increase learning and allow you to iterate faster.    \n\n## Tips to frame a problem in RL   \nMaybe it's unclear what the features are and what the reward should be, or if it's feasible at all.    \n\n1. First step: Visualize a random policy acting on this problem.   \n    - See where it takes you.    \n    - If random policy on occasion does the right thing, then high chance RL will do the right thing.   \n      - Policy gradient will find this behavior and make it more likely.  \n    - If random policy never does the right thing, RL will likely also not.   \n\n2. Make sure observations usable:\n    - See if YOU could control the system by using the same observations you give the agent.   \n      - Example: Look at preprocessed images yourself to make sure you don't remove necessary details or hinder the algorithm in a certain way.\n\n3. Make sure everything is reasonably scaled.   \n    - Rule of thumb: \n      - Observations: Make everything mean 0, standard deviation 1.\n      - Reward: If you control it, then scale it to a reasonable value.\n        - Do it across ALL your data so far.   \n    - Look at all observations and rewards and make sure there aren't crazy outliers.    \n\n4. Have good baseline whenever you see a new problem.   \n    - It's unclear which algorithm will work, so have a set of baselines (from other methods)\n      - Cross entropy method   \n      - Policy gradient methods \n      - Some kind of Q-learning method (checkout [OpenAI Baselines](https://github.com/openai/baselines) as a starter or [RLLab](https://github.com/rll/rllab)) \n\n## Reproducing papers    \nSometimes (often), it's hard to reproduce results from papers. Some tricks to do that:   \n\n1. Use more samples than needed.    \n2. Policy right... but not exactly\n     - Try to make it work a little bit.   \n     - Then tweak hyper parameters to get up to the public performance.   \n     - If want to get it to work at ALL, use bigger batch sizes. \n       - If batch size is too small, noisy will overpower signal.  \n       - Example: TRPO, John was using too tiny of a batch size and had to use 100k time steps. \n       - For DQN, best hyperparams: 10k time steps, 1mm frames in replay buffer.\n\n\n## Guidelines on-going training process   \nSanity check that your training is going well.    \n\n1. Look at sensitivity of EVERY hyper parameter\n    - If algo is too sensitive, then NOT robust and should NOT be happy with it.   \n    - Sometimes it happens that a method works one way because of funny dynamics but NOT in general.\n\n2. Look for indicators that the optimization process is healthy.  \n    - Varies \n    - Look at whether value function is accurate.\n      - Is it predicting well?    \n      - Is it predicting returns well?\n      - How big are the updates?   \n    - Standard diagnostics from deep networks   \n\n3. Have a system for continuously benchmarking code.    \n    - Needs DISCIPLINE.   \n    - Look at performance across ALL previous problems you tried.   \n      - Sometimes it'll start working on one problem but mess up performance in others.   \n      - Easy to over fit  on a single problem.\n    - Have a battery of benchmarks you run occasionally.   \n\n4. Think your algorithm is working but you're actually seeing random noise.   \n    - Example: Graph of 7 tasks with 3 algorithms and looks like 1 algorithm might be doing best on all problems, but turns out they're all the same algorithm with DIFFERENT random seeds.   \n\n5. Try different random seeds!!\n    - Run multiple times and average.   \n    - Run multiple tasks on multiple seeds. \n      - If not, you're likely to over fit.   \n\n6. Additional algorithm modifications might be unnecessary.      \n    - Most tricks are ACTUALLY normalizing something in some way or improving your optimization.  \n    - A lot of tricks also have the same effect... So you can remove some of them and SIMPLIFY your algorithm (VERY KEY).   \n\n7. Simplify your algorithm   \n    - Will generalize better\n\n8. Automate your experiments   \n    - Don't spend your whole day watching your code spit out numbers.   \n    - Launch experiments on cloud services and analyze results.   \n    - Frameworks to track experiments and results:\n      - Mostly use iPython notebooks.\n      - DBs seem unnecessary to store results.   \n\n\n## General training strategies\n1. Whiten and standardize data (for ALL seen data since the beginning).   \n    - Observations:\n      - Do it by computing a running mean and standard deviation. Then z-transform everything.   \n      - Over ALL data seen (not just the recent data).\n        - At least it'll scale down over time how fast it's changing.\n        - Might trip up the optimizer if you keep changing the objective. \n        - Rescaling (by using recent data) means your optimizer probably didn't know about that and performance will collapse.\n  \n    - Rewards:\n      - Scale and DON'T shift. \n        - Affects agent's will to live.\n        - Will change the problem (aka, how long you want it to survive).\n\n    - Standardize targets:\n      - Same way as rewards.\n  \n    - PCA Whitening?\n      - Could help.\n      - Starting to see if it actually helps with neural nets.\n      - Huge scales (-1000, 1000) or (-0.001, 0.001) certainly make learning slow.   \n\n2. Parameters that inform discount factors.\n    - Determines how far you're giving credit assignment.   \n    - Ex: if factor is 0.99, then you're ignoring what happened 100 steps ago... Means you're shortsighted. \n      - Better to look at how that corresponds to real time \n        - Intuition, in RL we're usually discretizing time.  \n        - aka: are those 100 steps 3 seconds of actual time? \n        - what happens during that time?\n    - If TD methods for policy gradient of Value fx estimation, gamma can be close to 1 (like 0.999)\n      - Algo becomes very stable.   \n\n3. Look to see that problem can actually be solved in the discretized level.  \n    - Example: In game if you're doing frame skip.\n      - As a human, can you control it or is it impossible?\n      - Look at what random exploration looks like \n        - Discretization determines how far your Brownian motion goes. \n        - If do many actions in a row, then tend to explore further.   \n        - Choose your time discretization in a way that works.\n\n4. Look at episode returns closely.   \n    - Not just mean, look at min and max.\n      - The max return is something your policy can hone in pretty well.\n      - Is your policy ever doing the right thing??\n    - Look at episode length (sometimes more informative than episode reward).\n      - if on game you might be losing every time so you might never win, but... episode length can tell you if you're losing SLOWER.\n      - Might see an episode length improvement in the beginning but maybe not reward.\n\n\n## Policy gradient diagnostics   \n1. Look at entropy really carefully   \n    - Entropy in ACTION space\n      - Care more about entropy in state space, but don't have good methods for calculating that.\n    - If going down too fast, then policy becoming deterministic and will not explore.   \n    - If NOT going down, then policy won't be good because it is really random.   \n    - Can fix by:\n      - KL penalty\n        - Keep entropy from decreasing too quickly.    \n      - Add entropy bonus.\n    - How to measure entropy.   \n      - For most policies can compute entropy analytically. \n        - If continuous, it's usually a Gaussian, so can compute differential entropy.  \n    \n2. Look at KL divergence\n    - Look at size of updates in terms of KL divergence.   \n    - example:\n      - If KL is .01 then very small.\n      - If 10 then too much.\n  \n3. Baseline explained variance.   \n    - See if value function is actually a good predictor or a reward.   \n      - if negative it might be overfitting or noisy.\n        - Likely need to tune hyper parameters\n\n4. Initialize policy   \n    - Very important (more so than in supervised learning).   \n    - Zero or tiny final layer to maximize entropy\n      - Maximize random exploration in the beginning   \n\n## Q-Learning Strategies \n1. Be careful about replay buffer memory usage.  \n    - You might need a huge buffer, so adapt code accordingly.   \n\n2. Play with learning rate schedule.   \n\n3. If converges slowly or has slow warm-up period in the beginning\n    - Be patient... DQN converges VERY slowly.   \n\n\n## Bonus from [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/):   \n1. A good feature can be to take the difference between two frames.   \n   - This delta vector can highlight slight state changes otherwise difficult to distinguish.   \n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FwilliamFalcon%2FDeepRLHacks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FwilliamFalcon%2FDeepRLHacks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FwilliamFalcon%2FDeepRLHacks/lists"}