{"id":28231807,"url":"https://github.com/mohammadzainabbas/reinforcement-learning-cs","last_synced_at":"2026-03-05T14:03:18.851Z","repository":{"id":64822661,"uuid":"578456117","full_name":"mohammadzainabbas/Reinforcement-Learning-CS","owner":"mohammadzainabbas","description":"💡 Grasp - Pick-and-place with a robotic hand 👨🏻‍💻","archived":false,"fork":false,"pushed_at":"2023-12-15T11:49:25.000Z","size":19981,"stargazers_count":12,"open_issues_count":1,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-07-15T13:25:02.435Z","etag":null,"topics":["brax","gym-environment","mamba","model-free-rl","physics-engine","ppo","ppo-agent","ppo-algorithm","python","pytorch","reinforcement-learning","sac"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mohammadzainabbas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-12-15T05:08:53.000Z","updated_at":"2025-05-23T12:33:43.000Z","dependencies_parsed_at":"2023-12-15T12:51:29.473Z","dependency_job_id":null,"html_url":"https://github.com/mohammadzainabbas/Reinforcement-Learning-CS","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/mohammadzainabbas/Reinforcement-Learning-CS","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohammadzainabbas%2FReinforcement-Learning-CS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohammadzainabbas%2FReinforcement-Learning-CS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohammadzainabbas%2FReinforcement-Learning-CS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohammadzainabbas%2FReinforcement-Learning-CS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mohammadzainabbas","download_url":"https://codeload.github.com/mohammadzainabbas/Reinforcement-Learning-CS/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohammadzainabbas%2FReinforcement-Learning-CS/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30130031,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T12:40:50.676Z","status":"ssl_error","status_checked_at":"2026-03-05T12:39:32.209Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["brax","gym-environment","mamba","model-free-rl","physics-engine","ppo","ppo-agent","ppo-algorithm","python","pytorch","reinforcement-learning","sac"],"created_at":"2025-05-18T19:10:52.834Z","updated_at":"2026-03-05T14:03:18.834Z","avatar_url":"https://github.com/mohammadzainabbas.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"## 💡 Grasp - Pick-and-place with a robotic hand 👨🏻‍💻\n\nYou can see the live demo [here](http://mohammadzainabbas.tech/Reinforcement-Learning-CS/).\n\n#\n\n### Table of contents\n\n- [🚀 Quickstart 💻](#quickstart)\n- [💻 Introduction 👨🏻‍💻](#introduction)\n- [🌊 Physics Simulation Engines 🦿](#physics-simulation-engines)\n- [🌪 Environment 🦾](#environment)\n\t* [🔭 `Observations` 🔍](#observations)\n\t* [🏄‍♂️ `Actions` 🤸‍♂️](#actions)\n\t* [🏆 `Reward` 🥇](#reward)\n- [🔬 Algorithms 💻](#algorithms)\n\t* [💡 `Proximal policy optimization (PPO)` 👨🏻‍💻](#ppo)\n\t* [💡 `Evolution Strategy (ES)` 👨🏻‍💻](#es)\n\t* [💡 `Augmented Random Search (ARS)` 👨🏻‍💻](#ars)\n\t* [💡 `Soft Actor-Critic (SAC)` 👨🏻‍💻](#sac)\n- [🚀 Run locally 🖲️](#run-locally)\n\n#\n\n\u003ca id=\"quickstart\" /\u003e\n\n### 1. 🚀 Quickstart 💻\n\nExplore the project easily and quickly through the following _colab_ notebooks:\n\n- [`Grasp: Pick-and-place with a robotic hand`](https://colab.research.google.com/github/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/notebooks/demo.ipynb) - this demo notebook compares first three [algorithms](#algorithms) and train agents on `Grasp` environment by `Brax`. At the end, it also shows trained `PPO agent` interaction with the environment.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/docs/assets/compare_algorithms.jpeg?raw=true\" width=\"auto\" height=\"225\"\u003e\n\u003c/p\u003e\n\n- [`Step-by-step training with PPO`](https://colab.research.google.com/github/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/notebooks/demo_ppo_train.ipynb) - this notebook shows step-by-step training of `PPO agent` on `Grasp` environment by `Brax`.\n\n#\n\n\u003ca id=\"introduction\" /\u003e\n\n### 2. 💻 Introduction 👨🏻‍💻\n\nThe field of robotics has seen incredible advancements in recent years, with the development of increasingly sophisticated machines capable of performing a wide range of tasks. One area of particular interest is the ability for robots to manipulate objects in their environment, known as grasping. In this project, we have chosen to focus on a specific grasping task - training a robotic hand to pick up a moving ball object and place it in a specific target location using the [`Brax` physics simulation engine](https://arxiv.org/pdf/2106.13281.pdf).\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/docs/assets/figure_1.jpeg?raw=true\" width=\"500\" height=\"300\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003eGrasp – robotic hand which picks a moving ball and moves it to a specific target\u003c/p\u003e\n\nThe reason for choosing this project is twofold. Firstly, the ability for robots to grasp and manipulate objects is a fundamental skill that is crucial for many real-world applications, such as manufacturing, logistics, and service industries. Secondly, the use of a physics simulation engine allows us to train our robotic hand in a realistic and controlled environment, without the need for expensive hardware and the associated costs and safety concerns.\n\nReinforcement learning is a powerful tool for training robots to perform complex tasks, as it allows the robot to learn through trial and error. In this project, we will be using reinforcement learning techniques to train our robotic hand, and we hope to demonstrate the effectiveness of this approach in solving the grasping task.\n\n#\n\n\u003ca id=\"physics-simulation-engines\" /\u003e\n\n### 3. 🌊 Physics Simulation Engines 🦿\n\nThe use of a physics simulation engine is essential for training a robotic hand to perform the grasping task, as it allows us to simulate the real-world physical interactions between the robot and the ball. Without a physics simulation engine, it would be difficult to accurately model the dynamics of the task, including the forces and torques required for the robotic hand to pick up the ball and move it to the target location.\n\nIn this project, we explored several different physics simulation engines, including:\n\n- [x] [`MuJoCo`](https://mujoco.org/) ([`dm_control`](https://github.com/deepmind/dm_control/), [`Gym`](https://www.gymlibrary.dev/) and [`Gymnasium`](https://gymnasium.farama.org/))\n- [x] [`TinyDiffSim`](https://github.com/erwincoumans/tiny-differentiable-simulator)\n- [x] [`DiffTaichi`](https://github.com/taichi-dev/difftaichi)\n- [x] [`Nimble`](https://github.com/keenon/nimblephysics)\n- [x] [`PyBullet`](https://github.com/bulletphysics/bullet3)\n- [x] [`Brax`](https://github.com/google/brax/). \n\nEach of these engines has its own strengths and weaknesses, and we carefully considered the trade-offs between them before making a final decision.\n\nUltimately, we chose to use [`Brax`](https://github.com/google/brax/) due to [_its highly scalable and parallelizable architecture_](https://ai.googleblog.com/2021/07/speeding-up-reinforcement-learning-with.html), which makes it well-suited for accelerated hardware (XLA backends such as `GPUs` and `TPUs`). This allows us to simulate the grasping task at a high level of realism and detail, while also taking advantage of the increased computational power of modern hardware to speed up the training process.\n\n#\n\n\u003ca id=\"environment\" /\u003e\n\n### 4. 🌪 Environment 🦾\n\nThe [grasping environment provided by `Brax`](https://github.com/google/brax/blob/198dee3ac4/brax/envs/grasp.py#L25-L1297) is a simple pick-and-place task, where a 4-fingered claw hand must pick up and move a ball to a target location. The environment is designed to simulate the physical interactions between the robotic hand and the ball, including the forces and torques required for the hand to grasp the ball and move it to the target location.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/docs/assets/figure_2.jpeg?raw=true\" width=\"500\" height=\"300\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003eThe hand is able to pick up the ball and carry it to a series of red targets. Once the ball gets close to the red target, the red target is respawned at a different random location\u003c/p\u003e\n\nIn the environment, the robotic hand is represented by a 4-fingered claw, which is capable of opening and closing to grasp the ball. The ball is placed in a random location at the beginning of each episode, and the target location is also randomly chosen. The goal of the robotic hand is to move the ball to the target location as quickly and efficiently as possible. For more details, check [_4.2.2_](https://arxiv.org/pdf/2106.13281.pdf).\n\n#\n\n\u003ca id=\"observations\" /\u003e\n\n#### 4.1. 🔭 Observations 🔍\n\nThe environment observes _three_ main bodies: the `Hand`, the `Object`, and the `Target`. The agent uses these observations to learn how to control the robotic hand and move the object to the target location.\n\n1. The `Hand` observation includes information about the state of the robotic hand, such as the position and orientation of the fingers, the joint angles, and the forces and torques applied to the hand. This information is used by the agent to control the hand and pick up the object.\n\n2. The `Object` observation includes information about the state of the object, such as its position, velocity, and orientation. This information is used by the agent to track the object and move it to the target location.\n\n3. The `Target` observation includes information about the target location, such as its position and orientation. This information is used by the agent to navigate the hand and the object to the target location.\n\nWhen the object reaches the target location, the agent is rewarded. The agent is also given a penalty if the object falls or if the hand collides with any obstacle. The agent's goal is to maximize the reward, which means reaching the target location as quickly and efficiently as possible.\n\nOverall, the observations provided by the [`Grasp environment`](https://github.com/google/brax/blob/198dee3ac4/brax/envs/grasp.py#L25-L1297) are designed to give the agent the information it needs to learn how to control the robotic hand and move the object to the target location. The combination of the Hand, Object, and Target observations allows the agent to learn from the environment and improve its performance over time.\n\n#\n\n\u003ca id=\"actions\" /\u003e\n\n#### 4.2. 🏄‍♂️ Actions 🤸‍♂️\n\nThe action has `19` dimensions, it’s the hand’s position and the joints’ angles, and it is normalized to the `[-1, 1]` as _continuous_ values.\n\n#\n\n\u003ca id=\"reward\" /\u003e\n\n#### 4.3. 🏆 Reward 🥇\n\nThe [reward function](https://github.com/google/brax/blob/198dee3ac4/brax/envs/grasp.py#L90-L121) is calculated using following equation:\n\n```math\n\\text{reward} = \\text{moving to object} + \\text{close to object} + \\text{touching object} + 5 * \\text{target hit} + \\text{moving to target}\n```\n\nwhere,\n\n```math\n\\text{moving to object} : \\text{small reward for moving towards the object.} \\nonumber \\\\\n```\n\n```math\n\\text{close to object} : \\text{small reward for being close to the object.} \\nonumber \\\\\n```\n\n```math\n\\text{touching object} : \\text{small reward for touching the object.} \\nonumber \\\\\n```\n\n```math\n\\text{target hit} : \\text{high reward for hitting the target (max. reward).} \\nonumber \\\\\n```\n\n```math\n\\text{moving to target} : \\text{high reward for moving towards the target.} \\nonumber\n```\n\n\u003e each minor step approaching the task completeness will be rewarded, while the $\\text{target hit}$ will gain the biggest reward.\n\n#\n\n\u003ca id=\"algorithms\" /\u003e\n\n### 5. 🔬 Algorithms 💻\n\nWe will use the brax’s optimized algorithms: `PPO`, `ES`, `ARS` and `SAC`.\n\n\u003ca id=\"ppo\" /\u003e\n\n#### 5.1. 💡 Proximal policy optimization (PPO) 👨🏻‍💻\n\n[`Proximal Policy Optimization (PPO)`](https://arxiv.org/abs/1707.06347) is a model-free online policy gradient reinforcement learning algorithm, developed at OpenAI in 2017. `PPO` strikes a balance between ease of implementation, sample complexity, and ease of tuning, trying to compute an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small. Generally speaking, it is a clipper version [`A2C`](https://huggingface.co/blog/deep-rl-a2c) algorithm.\n\n\u003ca id=\"es\" /\u003e\n\n#### 5.2. 💡 Evolution Strategy (ES) 👨🏻‍💻\n\n[`Evolution Strategy (ES)`](https://arxiv.org/abs/1707.06347) is inspired by natural evolution, it is a powerful black-box optimization technique. A group of random noise is tested for the network parameters, and the highest scoring parameter vectors are chosen to evolute the network. It is a different method compared with using the loss function to back propagate the network. `ES` can be parallelized using XLA backend (`CPU`/`GPU`/`TPU`) to speed up the training.\n\n\u003ca id=\"ars\" /\u003e\n\n#### 5.3. 💡 Augmented Random Search (ARS) 👨🏻‍💻\n\n[`Augmented Random Search (ARS)`](https://arxiv.org/abs/1803.07055) is a random search method for training linear policies for continuous control problems. It operates directly on the policy weights, each epoch the agent perturbs its current policy `N` times, and collects `2N` rollouts using the modified policies. The rewards from these rollouts are used to update the current policy weights, repeat until completion. The algorithm is known to have high variance; not all seeds obtain high rewards, but to our knowledge their work in many ways represents the state of the art on these benchmarks.\n\n\u003ca id=\"sac\" /\u003e\n\n#### 5.4. 💡 Soft Actor-Critic (SAC) 👨🏻‍💻\n\n[`Soft Actor-Critic (SAC)`](https://arxiv.org/abs/1801.01290) is an off-policy model-free reinforcement framework. The actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible, and that is why it’s called _soft_. `SAC` has better sample efficiency than `PPO`.\n\n#\n\n\u003ca id=\"run-locally\" /\u003e\n\n### 6. 🚀 Run locally 🖲️\n\nThese instructions will get you a copy of the project up and running on your local machine for development and testing purposes.\n\n1. Clone the repository\n\n```bash\ngit clone https://github.com/mohammadzainabbas/Reinforcement-Learning-CS.git\ncd Reinforcement-Learning-CS/\n```\n\n2. Create a new enviornment and install all dependencies\n\nFirst, [install `mamba`](https://mamba.readthedocs.io/en/latest/installation.html), a fast and efficient package manager for `conda`.\n\n```bash\nconda install mamba -n base -c conda-forge\n```\n\nThen, create a new environment and install all dependencies, and activate it.\n\n```bash\nmamba env create -n reinforcement_learning -f docs/config/reinforcement_learning_env.yaml\nconda activate reinforcement_learning\n```\n\n3. Run the code\n\n[`train_ppo.py`](https://github.com/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/src/train_ppo.py) - train the reinforcement learning agent using `PPO` algorithm:\n\n```bash\npython src/train_ppo.py\n```\n\nYou will get the following output files:\n\n* `ppo_training.png` - Training progress plot\n* `result_with_ppo.html` - Simulation of the trained agent (in HTML format)\n* `ppo_params` - Trained parameters of the agent\n\n[`train_sac.py`](https://github.com/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/src/train_sac.py) - train the reinforcement learning agent using `SAC` algorithm:\n\n```bash\npython src/train_sac.py\n```\n\n\u003e you will get the same output files as `PPO` algorithm.\n\n[`generate_results.py`](https://github.com/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/src/generate_results.py) - generate the results of the trained `PPO` agent:\n\n```bash\npython src/generate_results.py\n```\n\nyou can see the live output [here](http://mohammadzainabbas.tech/Reinforcement-Learning-CS/).\n\n[`ppo_with_pytorch.py`](https://github.com/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/src/ppo_with_pytorch.py) - implementation of `PPO` algorithm with `PyTorch`.\n\n#\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohammadzainabbas%2Freinforcement-learning-cs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmohammadzainabbas%2Freinforcement-learning-cs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohammadzainabbas%2Freinforcement-learning-cs/lists"}