https://github.com/mohammadzainabbas/reinforcement-learning-cs
π‘ Grasp - Pick-and-place with a robotic hand π¨π»βπ»
https://github.com/mohammadzainabbas/reinforcement-learning-cs
brax gym-environment mamba model-free-rl physics-engine ppo ppo-agent ppo-algorithm python pytorch reinforcement-learning sac
Last synced: 3 months ago
JSON representation
π‘ Grasp - Pick-and-place with a robotic hand π¨π»βπ»
- Host: GitHub
- URL: https://github.com/mohammadzainabbas/reinforcement-learning-cs
- Owner: mohammadzainabbas
- License: mit
- Created: 2022-12-15T05:08:53.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-12-15T11:49:25.000Z (almost 2 years ago)
- Last Synced: 2025-06-14T13:40:00.654Z (4 months ago)
- Topics: brax, gym-environment, mamba, model-free-rl, physics-engine, ppo, ppo-agent, ppo-algorithm, python, pytorch, reinforcement-learning, sac
- Language: Jupyter Notebook
- Homepage:
- Size: 19.1 MB
- Stars: 12
- Watchers: 2
- Forks: 4
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## π‘ Grasp - Pick-and-place with a robotic hand π¨π»βπ»
You can see the live demo [here](http://mohammadzainabbas.tech/Reinforcement-Learning-CS/).
#
### Table of contents
- [π Quickstart π»](#quickstart)
- [π» Introduction π¨π»βπ»](#introduction)
- [π Physics Simulation Engines π¦Ώ](#physics-simulation-engines)
- [πͺ Environment π¦Ύ](#environment)
* [π `Observations` π](#observations)
* [πββοΈ `Actions` π€ΈββοΈ](#actions)
* [π `Reward` π₯](#reward)
- [π¬ Algorithms π»](#algorithms)
* [π‘ `Proximal policy optimization (PPO)` π¨π»βπ»](#ppo)
* [π‘ `Evolution Strategy (ES)` π¨π»βπ»](#es)
* [π‘ `Augmented Random Search (ARS)` π¨π»βπ»](#ars)
* [π‘ `Soft Actor-Critic (SAC)` π¨π»βπ»](#sac)
- [π Run locally π²οΈ](#run-locally)#
### 1. π Quickstart π»
Explore the project easily and quickly through the following _colab_ notebooks:
- [`Grasp: Pick-and-place with a robotic hand`](https://colab.research.google.com/github/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/notebooks/demo.ipynb) - this demo notebook compares first three [algorithms](#algorithms) and train agents on `Grasp` environment by `Brax`. At the end, it also shows trained `PPO agent` interaction with the environment.
![]()
- [`Step-by-step training with PPO`](https://colab.research.google.com/github/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/notebooks/demo_ppo_train.ipynb) - this notebook shows step-by-step training of `PPO agent` on `Grasp` environment by `Brax`.
#
### 2. π» Introduction π¨π»βπ»
The field of robotics has seen incredible advancements in recent years, with the development of increasingly sophisticated machines capable of performing a wide range of tasks. One area of particular interest is the ability for robots to manipulate objects in their environment, known as grasping. In this project, we have chosen to focus on a specific grasping task - training a robotic hand to pick up a moving ball object and place it in a specific target location using the [`Brax` physics simulation engine](https://arxiv.org/pdf/2106.13281.pdf).
![]()
Grasp β robotic hand which picks a moving ball and moves it to a specific target
The reason for choosing this project is twofold. Firstly, the ability for robots to grasp and manipulate objects is a fundamental skill that is crucial for many real-world applications, such as manufacturing, logistics, and service industries. Secondly, the use of a physics simulation engine allows us to train our robotic hand in a realistic and controlled environment, without the need for expensive hardware and the associated costs and safety concerns.
Reinforcement learning is a powerful tool for training robots to perform complex tasks, as it allows the robot to learn through trial and error. In this project, we will be using reinforcement learning techniques to train our robotic hand, and we hope to demonstrate the effectiveness of this approach in solving the grasping task.
#
### 3. π Physics Simulation Engines π¦Ώ
The use of a physics simulation engine is essential for training a robotic hand to perform the grasping task, as it allows us to simulate the real-world physical interactions between the robot and the ball. Without a physics simulation engine, it would be difficult to accurately model the dynamics of the task, including the forces and torques required for the robotic hand to pick up the ball and move it to the target location.
In this project, we explored several different physics simulation engines, including:
- [x] [`MuJoCo`](https://mujoco.org/) ([`dm_control`](https://github.com/deepmind/dm_control/), [`Gym`](https://www.gymlibrary.dev/) and [`Gymnasium`](https://gymnasium.farama.org/))
- [x] [`TinyDiffSim`](https://github.com/erwincoumans/tiny-differentiable-simulator)
- [x] [`DiffTaichi`](https://github.com/taichi-dev/difftaichi)
- [x] [`Nimble`](https://github.com/keenon/nimblephysics)
- [x] [`PyBullet`](https://github.com/bulletphysics/bullet3)
- [x] [`Brax`](https://github.com/google/brax/).Each of these engines has its own strengths and weaknesses, and we carefully considered the trade-offs between them before making a final decision.
Ultimately, we chose to use [`Brax`](https://github.com/google/brax/) due to [_its highly scalable and parallelizable architecture_](https://ai.googleblog.com/2021/07/speeding-up-reinforcement-learning-with.html), which makes it well-suited for accelerated hardware (XLA backends such as `GPUs` and `TPUs`). This allows us to simulate the grasping task at a high level of realism and detail, while also taking advantage of the increased computational power of modern hardware to speed up the training process.
#
### 4. πͺ Environment π¦Ύ
The [grasping environment provided by `Brax`](https://github.com/google/brax/blob/198dee3ac4/brax/envs/grasp.py#L25-L1297) is a simple pick-and-place task, where a 4-fingered claw hand must pick up and move a ball to a target location. The environment is designed to simulate the physical interactions between the robotic hand and the ball, including the forces and torques required for the hand to grasp the ball and move it to the target location.
![]()
The hand is able to pick up the ball and carry it to a series of red targets. Once the ball gets close to the red target, the red target is respawned at a different random location
In the environment, the robotic hand is represented by a 4-fingered claw, which is capable of opening and closing to grasp the ball. The ball is placed in a random location at the beginning of each episode, and the target location is also randomly chosen. The goal of the robotic hand is to move the ball to the target location as quickly and efficiently as possible. For more details, check [_4.2.2_](https://arxiv.org/pdf/2106.13281.pdf).
#
#### 4.1. π Observations π
The environment observes _three_ main bodies: the `Hand`, the `Object`, and the `Target`. The agent uses these observations to learn how to control the robotic hand and move the object to the target location.
1. The `Hand` observation includes information about the state of the robotic hand, such as the position and orientation of the fingers, the joint angles, and the forces and torques applied to the hand. This information is used by the agent to control the hand and pick up the object.
2. The `Object` observation includes information about the state of the object, such as its position, velocity, and orientation. This information is used by the agent to track the object and move it to the target location.
3. The `Target` observation includes information about the target location, such as its position and orientation. This information is used by the agent to navigate the hand and the object to the target location.
When the object reaches the target location, the agent is rewarded. The agent is also given a penalty if the object falls or if the hand collides with any obstacle. The agent's goal is to maximize the reward, which means reaching the target location as quickly and efficiently as possible.
Overall, the observations provided by the [`Grasp environment`](https://github.com/google/brax/blob/198dee3ac4/brax/envs/grasp.py#L25-L1297) are designed to give the agent the information it needs to learn how to control the robotic hand and move the object to the target location. The combination of the Hand, Object, and Target observations allows the agent to learn from the environment and improve its performance over time.
#
#### 4.2. πββοΈ Actions π€ΈββοΈ
The action has `19` dimensions, itβs the handβs position and the jointsβ angles, and it is normalized to the `[-1, 1]` as _continuous_ values.
#
#### 4.3. π Reward π₯
The [reward function](https://github.com/google/brax/blob/198dee3ac4/brax/envs/grasp.py#L90-L121) is calculated using following equation:
```math
\text{reward} = \text{moving to object} + \text{close to object} + \text{touching object} + 5 * \text{target hit} + \text{moving to target}
```where,
```math
\text{moving to object} : \text{small reward for moving towards the object.} \nonumber \\
``````math
\text{close to object} : \text{small reward for being close to the object.} \nonumber \\
``````math
\text{touching object} : \text{small reward for touching the object.} \nonumber \\
``````math
\text{target hit} : \text{high reward for hitting the target (max. reward).} \nonumber \\
``````math
\text{moving to target} : \text{high reward for moving towards the target.} \nonumber
```> each minor step approaching the task completeness will be rewarded, while the $\text{target hit}$ will gain the biggest reward.
#
### 5. π¬ Algorithms π»
We will use the braxβs optimized algorithms: `PPO`, `ES`, `ARS` and `SAC`.
#### 5.1. π‘ Proximal policy optimization (PPO) π¨π»βπ»
[`Proximal Policy Optimization (PPO)`](https://arxiv.org/abs/1707.06347) is a model-free online policy gradient reinforcement learning algorithm, developed at OpenAI in 2017. `PPO` strikes a balance between ease of implementation, sample complexity, and ease of tuning, trying to compute an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small. Generally speaking, it is a clipper version [`A2C`](https://huggingface.co/blog/deep-rl-a2c) algorithm.
#### 5.2. π‘ Evolution Strategy (ES) π¨π»βπ»
[`Evolution Strategy (ES)`](https://arxiv.org/abs/1707.06347) is inspired by natural evolution, it is a powerful black-box optimization technique. A group of random noise is tested for the network parameters, and the highest scoring parameter vectors are chosen to evolute the network. It is a different method compared with using the loss function to back propagate the network. `ES` can be parallelized using XLA backend (`CPU`/`GPU`/`TPU`) to speed up the training.
#### 5.3. π‘ Augmented Random Search (ARS) π¨π»βπ»
[`Augmented Random Search (ARS)`](https://arxiv.org/abs/1803.07055) is a random search method for training linear policies for continuous control problems. It operates directly on the policy weights, each epoch the agent perturbs its current policy `N` times, and collects `2N` rollouts using the modified policies. The rewards from these rollouts are used to update the current policy weights, repeat until completion. The algorithm is known to have high variance; not all seeds obtain high rewards, but to our knowledge their work in many ways represents the state of the art on these benchmarks.
#### 5.4. π‘ Soft Actor-Critic (SAC) π¨π»βπ»
[`Soft Actor-Critic (SAC)`](https://arxiv.org/abs/1801.01290) is an off-policy model-free reinforcement framework. The actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible, and that is why itβs called _soft_. `SAC` has better sample efficiency than `PPO`.
#
### 6. π Run locally π²οΈ
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
1. Clone the repository
```bash
git clone https://github.com/mohammadzainabbas/Reinforcement-Learning-CS.git
cd Reinforcement-Learning-CS/
```2. Create a new enviornment and install all dependencies
First, [install `mamba`](https://mamba.readthedocs.io/en/latest/installation.html), a fast and efficient package manager for `conda`.
```bash
conda install mamba -n base -c conda-forge
```Then, create a new environment and install all dependencies, and activate it.
```bash
mamba env create -n reinforcement_learning -f docs/config/reinforcement_learning_env.yaml
conda activate reinforcement_learning
```3. Run the code
[`train_ppo.py`](https://github.com/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/src/train_ppo.py) - train the reinforcement learning agent using `PPO` algorithm:
```bash
python src/train_ppo.py
```You will get the following output files:
* `ppo_training.png` - Training progress plot
* `result_with_ppo.html` - Simulation of the trained agent (in HTML format)
* `ppo_params` - Trained parameters of the agent[`train_sac.py`](https://github.com/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/src/train_sac.py) - train the reinforcement learning agent using `SAC` algorithm:
```bash
python src/train_sac.py
```> you will get the same output files as `PPO` algorithm.
[`generate_results.py`](https://github.com/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/src/generate_results.py) - generate the results of the trained `PPO` agent:
```bash
python src/generate_results.py
```you can see the live output [here](http://mohammadzainabbas.tech/Reinforcement-Learning-CS/).
[`ppo_with_pytorch.py`](https://github.com/mohammadzainabbas/Reinforcement-Learning-CS/blob/main/src/ppo_with_pytorch.py) - implementation of `PPO` algorithm with `PyTorch`.
#