{"id":20602462,"url":"https://github.com/augustunderground/gaceir","last_synced_at":"2026-06-18T16:32:27.636Z","repository":{"id":112836087,"uuid":"456946941","full_name":"AugustUnderground/gaceir","owner":"AugustUnderground","description":"RL Agents to solve GACE","archived":false,"fork":false,"pushed_at":"2022-03-15T17:00:23.000Z","size":732,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-09T12:53:54.537Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AugustUnderground.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-02-08T13:36:34.000Z","updated_at":"2022-02-17T17:26:32.000Z","dependencies_parsed_at":"2023-05-31T10:15:09.608Z","dependency_job_id":null,"html_url":"https://github.com/AugustUnderground/gaceir","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AugustUnderground/gaceir","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AugustUnderground%2Fgaceir","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AugustUnderground%2Fgaceir/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AugustUnderground%2Fgaceir/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AugustUnderground%2Fgaceir/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AugustUnderground","download_url":"https://codeload.github.com/AugustUnderground/gaceir/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AugustUnderground%2Fgaceir/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34499405,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-18T02:00:06.871Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T09:13:55.965Z","updated_at":"2026-06-18T16:32:27.624Z","avatar_url":"https://github.com/AugustUnderground.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GAC²EIR\n\nClean and Self Contained implementations of Reinforcement Learning Agents for\nsolving [GAC²E](https://github.com/augustunderground/gace).\n\n- [X] PPO (Probabilistic)\n- [X] TD3 (Deterministic)\n- [X] SAC (Probabilistic)\n\n## Usage\n\nCleanup:\n\n```\n$ make clean wipe kill\n```\n\nWill clean model directory, wipe run directory and kill all spectre\ninstances.\n\n# Notes on Algorithms\n\nMy personal notes on implementing the algorithms.\n\n## Proximal Policy Optimization (PPO)\n\n[Proximal Policy Optimization Algorithm](https://arxiv.org/abs/1707.06347)\n\n- Keep track of small, fixed length batch of trajectories (s,a,r,d,v,l)\n- Multiple epochs for each batch\n- batch sized chunks of memories\n- Critic **only** criticises states\n- Actor outputs probabilities for taking an action (probabilistic)\n\n### Hyper Parameters\n\n- Memory Size\n- Batch size\n- Number of Epochs\n\n### Network Updates\n\n#### Actor\n\nConservative Policy Iteration (CPI):\n\n$$ L^{CPI} (\\theta) = E_{t} ( \\frac{\\pi_{\\theta} (a_{t} | s_{t})}{\\pi_{\\theta,old} (a_{t} | s_{t})} \\cdot A_{t} ) = E_{t} (r_{t}(\\theta) \\cdot A_{t}) $$\n\nWhere\n- A: Advantage\n- E: Expectation\n- π Actor Network returning Probability of an action a for a given state s at\n  a given time t\n- θ: Current network parameters\n\n$$ L^{CLIP} = E_{t} ( min(r_{t}(\\theta) \\cdot A_{t}, clip(r_{t}(\\theta), 1 - \\epsilon, 1 + \\epsilon) \\cdot A_{t} ) ) $$\n\nWhere\n- ε ≈ 0.2\n\n**Pessimistic lower bound of loss**\n\n##### Advantage\n\nGives benefit of new state over previous state\n\n$$ A_{t} = \\delta_{t} + (\\gamma \\lambda) \\cdot \\delta_{t + 1} + ... + (\\gamma \\lambda)^{T - (t + 1)} \\cdot \\delta_{T - 1} $$\nwith\n\n$$ \\delta_{t} = r_{t} + \\gamma \\cdot V(s_{t + 1}) - V(s_{t}) $$\n\nWhere\n- V(s_t): Critic output, aka Estimated Value (stored in memory)\n- γ ≈ 0.95\n\n#### Critic\n\nreturn = advantage + value\n\nWhere value is critic output stored in memory\n\n$$ L^{VF} = MSE(return - value) $$\n\n#### Total Loss\n\n$$ L^{CLIP + VF + S}_{t} (\\theta) = E_{t} [ L^{CLIP}_{t} (\\theta) - c_{1} \\cdot L^{VF}_{t} (\\theta) + c_{2} \\cdot S[\\pi_{\\theta}](s_{t}) ] $$\n\nGradient Ascent, **not** Descent!\n\n- S: only used for shared AC Network\n- c1 = 0.5\n\n## Twin Delayed Double Dueling Policy Gradient (TD3)\n\n[Addressing Function Approximation Error in Actor-Critic Methods](https://arxiv.org/abs/1802.09477)\n\n### Hyper Parameters\n\n- Update Intervall\n- Number of Epochs\n- Number of Samples\n\n### Loss\n\n$$ \\nabla_{\\phi} J_(\\phi) = \\frac{1}{N} \\sum \\nabla_{a} Q_{\\theta 1} (s,a) |_{a = \\pi_{\\phi} (s)} \\cdot \\nabla_{\\phi} \\pi_{\\phi} (s) $$\n\nWhere\n- π: Policy Network with parameters φ\n- Gradient of first critic w.r.t. actions chosen by critic\n- Gradient of policy network w.r.t. it's own parameters\n\nChain rule applied to loss function\n\n### Network Updates\n\nInitialize Target Networks with parameters from online networks.\n\n$$ \\theta \\leftarrow \\tau \\cdot \\theta_{i} + (1 - \\tau) \\cdot \\theta_{i}' $$\n$$ \\phi \\leftarrow \\tau \\cdot \\phi{i} + (1 - \\tau) \\cdot \\phi{i}' $$\n\nWhere \n- τ ≈ 0.005\n\n_Soft_ update with heavy weight on current target parameters vs. heavily\ndiscounted parameters of online network.\n\nNot every step, only after actor update.\n\n#### Actor\n\n- Randomly sample trajectories from replay buffer (s,a,r,s')\n- Use actor to determine actions for sampled states (don't use actions from memory)\n- Use sampled states and newly found actions to get values from critic\n    + Only the first critic, never the second!\n- Take gradient w.r.t. actor network parameters\n- Every nth step (hyper parameter of algorithm)\n\n#### Critic\n\n- Randomly sample trajectories from replay buffer (s,a,r,s')\n- New states run Ï'(s') where Ï' is target actor\n- Add noise and clip\n\n$$ a^{~} \\leftarrow \\pi_{\\phi'} (s') + \\epsilon $$\n\nwith\n\n$$ \\epsilon ~ clip(N(0, \\sigma), -c, c) $$\n\nWhere\n- σ ≈ 0.2, noise standard deviation\n- c ≈ 0.5, noise clipping\n- γ ≈ 0.99, discount factor\n\n$$ y \\leftarrow r + \\gamma \\cdot min( Q'_{\\theta1}(s', a^{~}), Q'_{\\theta1}(s', a^{~})) $$\n\n$$ \\theta_{i} \\leftarrow argmin_{\\theta i} ( N^{-1} \\cdot \\sum ( y - Q_{\\theta i} (s,a))^{2} ) $$\n\n## Soft Actor Critic (SAC)\n\n[Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor](https://arxiv.org/abs/1801.01290)\n\n**Note**: Entropy in this case means something like _Randomness of actions_,\nand is modeled by reward scaling.\n\n$$ log( \\pi (a|s) ) = log (\\mu (a|s)) - \\sum^{D}_{i=1} log ( 1 - tanh^{2} (a_{i}) ) $$\n\nWhere\n- μ: Sample of a distribution (**NOT MEAN**)\n- π: Probability of selecting this particular action a given state s\n\n### Hyper Parameters\n\n- Target smoothing coefficient\n- target update interval\n- replay buffer size\n- gradient steps\n\n### Actor Update\n\n$$ J = N^{-1} \\cdot \\sum log(\\pi (a_{t} | s_{t})) - Q_{min}(s_{t}, a_{t}) $$\n\nWhere\n- s_t is sampled from replay buffer / memory\n- a_t is generated with actor network given sampled states\n- Qmin is minimum of 2 critics\n\n### Value Update\n\n$$ J = N^{-1} \\cdot \\sum \\frac{1}{2} \\cdot ( V(s_{t}) - Qmin (s_{t}, a_{t}) - log(\\pi (a_{t} | s_{t})) ) $$\n\nWhere\n- V(s_t): sampled values from memory\n- s_t: sampled states from memory\n- a_t: newly computed actions\n\n### Critic\n\n$$ J_{1} = N^{-1} \\sum \\frac{1}{2} \\cdot ( Q_{1}(s_{t}, a_{t}) - Q'_{1}(s_{t}, a_{t}))^{2} $$\n\n$$ J_{2} = N^{-1} \\sum \\frac{1}{2} \\cdot ( Q_{2}(s_{t}, a_{t}) - Q'_{2}(s_{t}, a_{t}))^{2} $$\n\n$$ Q'= r_{scaled} + \\gamma \\cdot V'(s_{t + 1}) $$\n\nWhere\n- **Both** critics get updated\n- _Both_ actions and states are sampled from memory\n\n### Network Updates\n\n$$ \\psi \\leftarrow \\tau \\cdot \\psi + (1 - \\tau) \\cdot \\psi' $$\n\nWhere\n- τ ≈ 0.005\n\n## Things TODO\n\n- [X] Implement replay buffer / memory as algebraic data type\n- [X] Include step count in reward\n- [ ] Try Discrete action spaces\n- [ ] Normalize and/or reduce observation space\n- [ ] consider previous reward\n- [X] return trained models instead of loss\n- [ ] handling of `done` for parallel envs\n- [ ] higher reward for finishsing earlier\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faugustunderground%2Fgaceir","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faugustunderground%2Fgaceir","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faugustunderground%2Fgaceir/lists"}