https://github.com/opendrivelab/univla

[RSS 2025] Learning to Act Anywhere with Task-centric Latent Actions
https://github.com/opendrivelab/univla

robot-learning vla

Last synced: 7 months ago
JSON representation

[RSS 2025] Learning to Act Anywhere with Task-centric Latent Actions

Host: GitHub
URL: https://github.com/opendrivelab/univla
Owner: OpenDriveLab
License: apache-2.0
Created: 2025-04-23T06:48:48.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-05-31T14:46:53.000Z (8 months ago)
Last Synced: 2025-06-06T20:15:50.501Z (7 months ago)
Topics: robot-learning, vla
Language: Python
Homepage: https://arxiv.org/abs/2505.06111
Size: 2.94 MB
Stars: 381
Watchers: 6
Forks: 16
Open Issues: 6
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# :earth_asia: UniVLA

> #### :page_facing_up: [Paper](https://arxiv.org/pdf/2505.06111) | :rocket: Demo Page (Coming Soon)
> :black_nib: Qingwen Bu, Y. Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, H. Li \
> :e-mail: Primary Contact: Qingwen Bu (buqingwen@opendrivelab.com)

### :fire: Highlights
- A recipe towards generalist policy by planning in a unified, embodiment-agnostic action space.
- A novel approach for extracting task-centric latent actions from cross-embodiment videos.
- A VLA that achieves state-of-the-art results on multiple benchmarks with compute-efficient training.

## Table of Contents
- [:movie_camera: Demo](#movie_camera-demo)
- [:loudspeaker: News](#loudspeaker-news)
- [:pushpin: TODO List](#pushpin-todo-list)
- [🤗 Model Zoo](#ckpts)
- [:video_game: Getting Started](#installation)
- [:fire: Training Recipe](#fire-training-recipe)
- [Task-centric Latent Action Learning](#one-task-centric-latent-action-learning)
- [Pretraining of Generalist Policy](#two-pretraining-of-generalist-policy)
- [Post-training for Deployment & Evaluations](#three-post-training-for-deployment--evaluations)
- [Real-world Experiment](#mechanical_arm-real-world-experiment)
- [LIBERO Benchmark](#1-libero)
- [:rocket: UniVLA's Performance](#rocket-univlas-performance)
- [:pencil: Citation](#pencil-citation)

## :movie_camera: Demo
Real-world robot experiments.

Store the screwdriver (1x speed)
Clean the cutting board (1x speed)
Fold towel twice (1x speed)

Stack the tower of hanoi (1x speed)

## :loudspeaker: News

- **[2025/05]** The code of UniVLA v1.0 is released. Please check it out!

## :pushpin: TODO list

#### 1. 🤗 Checkpoints Release
- [x] 1) Latent action model
- [x] 2) Pre-trained Models
- [x] *Full (Manip. + Navi. + Human)*
- [x] *BridgeV2-Only*
- [x] *Human-Only*
- [x] 3) Downstream Fine-tuned Models
- [x] *LIBERO*
- [ ] *Room2Room*
- [ ] *CALVIN*
- [ ] *SimplerEnv*
#### 2. 💪 Training and Evlauation Codes on Simulation Benchmarks
- [x] **1) LIBERO**
- [ ] **2) Room2Room**
- [ ] **3) CALVIN**
- [ ] **4) SimplerEnv**
#### 3. :dizzy: Codes and Guidelines for Real-world Deployment
- [x] Codes and Docs
#### 4. :information_desk_person: Scripts for Pre-processing Human Dataset
- [ ] Codes for converting Ego4D into RLDS format

## 🤗 Model Zoo

Model Name
Backbone
HF Path
Note

lam-stage-1
-
univla-latent-action-model
The stage-1 latent action model trained on OpenX and Ego4D.

lam-stage-2
-
univla-latent-action-model
The stage-2 latent action model trained on OpenX and Ego4D. (Generate task-centric latent actions.)

univla-7b
TRI-ML/prismatic-vlms/prism-dinosiglip-224px+7b
univla-7b
UniVLA pretrained on our full data collection (Manip. + Navi. + Human).

univla-7b-bridge-pt
TRI-ML/prismatic-vlms/prism-dinosiglip-224px+7b
univla-7b-bridge-pt
UniVLA pretrained only on BridgeV2 data.

univla-7b-human-pt
TRI-ML/prismatic-vlms/prism-dinosiglip-224px+7b
univla-7b-human-pt
UniVLA pretrained only on Ego4D human videos.

univla-7b-224-sft-libero
univla-7b
univla-7b-224-sft-libero
Finetuned on the LIBERO dataset

## :video_game: Getting Started

1. (Optional) We use conda to manage the environment.

```bash
conda create -n univla python=3.10 -y
conda activate univla
```

2. Install dependencies.

```bash
# Install pytorch
# Look up https://pytorch.org/get-started/previous-versions/ with your cuda version for a correct command
# Our experiments are conducted with 'torch 2.2.0 + cuda 12.1'
pip install torch torchvision

# Clone our repo and pip install to download dependencies
git clone git@github.com:OpenDriveLab/UniVLA.git
cd univla
pip install -e .

# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
pip install packaging ninja
ninja --version; echo $? # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation
```

## :fire: Training Recipe

### :one: Task-centric Latent Action Learning
> We hightly recommond directly using our pre-trained latent action model ckeckpoints to save your time and compute.

> [!NOTE]
> Our latent action model is trained on a comprehensive data collection, encompassing multiple robotic manipulation and navigation datasets from Open X-Embodiment, along with a curated subset of the Ego4D dataset (detailed data construction procedures are provided in the appendix of our [paper](https://www.roboticsproceedings.org/rss21/p014.pdf)).
>
> To adapt the model to additional datasets or custom data sources, users may refer to ```./prismatic/vla/datasets/rlds/oxe/mixtures.py``` to either utilize predefined data mixtures or define new ones. Subsequently, the ```data_mix``` parameter in the [configuration file](https://github.com/OpenDriveLab/UniVLA/blob/aab94fdf98221a19c0c9a114c921f069ed449265/latent_action_model/config/lam-stage-1.yaml#L27) should be updated accordingly.

The latent action model is implemented based on [VQ-VAE](https://arxiv.org/abs/1711.00937).
We train the latent action model on the collection of dataset comprising robot manipulation, navigation and human videos. In stage-1 training, we use an overall batch size of 512 and 100k optimization steps to construct the task-irrelevant latent actions:

```bash
torchrun --standalone --nnodes 1 --nproc-per-node 8 main.py fit \
--config config/lam-stage-1.yaml \
2>&1 | tee lam-stage-1.log
```

The following stage-2 then focuses on learning task-centric latent actions on the basis of stage-1 results. Please modify the ```stage_one_ckpt``` in ```latent_action_model/config/lam-stage-2.yaml``` to your local path of stage-1 checkpoint, then run training with:

```bash
torchrun --standalone --nnodes 1 --nproc-per-node 8 main.py fit \
--config config/lam-stage-2.yaml \
2>&1 | tee lam-stage-2.log
```

### :two: Pretraining of Generalist Policy

- **Latent Action Pseudo-Labeling for Policy Optimization:** The trained latent action model is employed to generate pseudo-labels for policy optimization via a next-token prediction objective. Specifically, the indices of inferred latent actions in the VQ-VAE codebook are mapped to dedicated tokens in the LLaMA tokenizer, denoted as ```{ACT_0, ACT_1, ..., ACT_C}```.

- **Cost-effective Pre-Training:** The full-scale pre-training procedure, incorporating both OpenX and Ego4D datasets, was performed using a 32-GPU A100 cluster over 20,000 optimization steps. This training regimen required approximately 960 A100 GPU-hours, representing just 5% of the computational resources utilized by OpenVLA. Furthermore, experiments conducted on the 'Bridge' and 'Human' subsets demanded only 200 GPU-hours, demonstrating substantially reduced computational requirements compared to previous vision-language-action models.

- To initiate pre-training, please refer to the following scipt or simply run ```bash ./vla-scripts/train.sh```:

> [!NOTE]
> For pretraining UniVLA only on BridgeV2 or Human (Ego4D) data, please modify ```vla.type``` to ```prism-dinosiglip-224px+mx-bridge(human)``` correspondingly. Detailed setups can be found in ```./prismatic/conf/vla.py```.

```bash
### Experiment on a 32-GPU cluster
GPUS_PER_NODE=8
NNODES=4
MASTER_PORT=${MASTER_PORT:-28596}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
RANK=${RANK:-0}

# Run your training script with torchrun
torchrun --nproc_per_node ${GPUS_PER_NODE} --nnodes ${NNODES} --node_rank ${RANK} --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} train.py \
--vla.type prism-dinosiglip-224px+mx-oxe-magic-soup-plus \
--run_root_dir "vla_log" \
```

### :three: Post-training for Deployment & Evaluations

- With the pretrained generalist policy trained to plan over an embodiment-agnostic action space, we then add embodiment-specific action decoder heads for downstream deployment.
- Our action decoder is extremely lightwight with only around 12M parameters. Using parameter efficient fine-tuning with LoRA rank 32, the total trainable parameter is around 123M.

#### :mechanical_arm: Real-world Experiment

> Our guidelines are based on real-device testing conducted on the AgiLex platform. If you have code deployed on other platforms or in different data formats, we welcome pull requests!

We provide a simple [guideline](https://github.com/OpenDriveLab/UniVLA/blob/3daa7e9a8f4ca92fdee960f8d6be73508344e81d/docs/real-world-deployment.md) to deploy UniVLA on your customized setups.

#### 1) LIBERO
> Please first download the [LIBERO datasets](https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main) that we used in experiments

Start training with ```torchrun```:
1) You should first set the pretrained UniVLA and latent action model path in ```vla_path``` and ```lam_path``` of the [training config](https://github.com/OpenDriveLab/UniVLA/blob/b502b3eddc05fef9984d34932a41c96e5a9f21a3/vla-scripts/finetune_libero.py#L107).
2) Set your local LIBERO dataset path in [```data_root_dir```](https://github.com/OpenDriveLab/UniVLA/blob/b502b3eddc05fef9984d34932a41c96e5a9f21a3/vla-scripts/finetune_libero.py#L110).
3) You can choose ```dataset_name``` from ```libero_spatial_no_noops```, ```libero_object_no_noops```, ```libero_goal_no_noops```, and ```libero_10_no_noops```
> We trained on *'Spatial'*, *'Object'* and *'Goal'* for 30k steps and *'Long'* for 40k steps. Please first modify the ```max_steps``` in training config accordingly for reproduction.

```bash
# Start training on LIBERO-10(long) with 8 GPUs
torchrun --standalone --nnodes 1 --nproc-per-node 8 finetune_libero.py \
--dataset_name "libero_10_no_noops" \
--run_root_dir "libero_log" \
```

Once you finished training and get the action decoder and VLA backbone, you can simply start evaluation with:

```bash
# Start evaluation on LIBERO-10
# [Optional] Install LIBERO dependencies
pip install -r experiments/robot/libero/libero_requirements.txt

# By default, we test for 50 rollouts every task, totalling 500 independent trials.
python experiments/robot/libero/run_libero_eval.py \
--task_suite_name libero_10 \ # Choose from [libero_spatial, libero_object, libero_goal, libero_10]
--action_decoder_path /path/to/your/action_decoder_path.pt \
--pretrained_checkpoint /path/to/your/libero_10_finetuned_univla \
--save_video False # Whether to save rollout videos \
--num_trials_per_task 50 \
--seed 7
```

> To be updated.

## :rocket: UniVLA's Performance

> [!NOTE]
> LIBERO Simulation Benchmark Results.

Model
LIBERO-Spatial
LIBERO-Object
LIBERO-Goal
LIBERO-Long
Average

SR (↑)
Rank (↓)
SR (↑)
Rank (↓)
SR (↑)
Rank (↓)
SR (↑)
Rank (↓)
SR (↑)
Rank (↓)

Diffusion Policy
78.3 ± 1.1%
5
92.5 ± 0.7%
2
68.3 ± 1.2%
5
50.5 ± 1.3%
5
72.4 ± 0.7%
5

Octo
78.9 ± 1.0%
4
85.7 ± 0.9%
4
84.6 ± 0.9%
2
51.1 ± 1.3%
4
75.1 ± 0.6%
3

OpenVLA
84.7 ± 0.9%
2
88.4 ± 0.8%
3
79.2 ± 1.0%
3
53.7 ± 1.3%
3
76.5 ± 0.6%
2

TraceVLA
84.6 ± 0.2%
3
85.2 ± 0.4%
5
75.1 ± 0.3%
4
54.1 ± 1.0%
2
74.8 ± 0.5%
4

UniVLA (Ours)
96.5 ± 0.5%
1
96.8 ± 0.5%
1
95.6 ± 0.4%
1
92.0 ± 1.0%
1
95.2 ± 0.3%
1

> [!NOTE]
> LIBERO Results with Limited Data. (Models are trained with 10%, 20%, 50%, and the full dataset)

Model
LIBERO-Goal
LIBERO-Long

10%
20%
50%
100%
10%
20%
50%
100%

ATM
64.3%
77.1%
-
-
36.5%
39.1%
-
-

OpenVLA
61.4%
66.0%
77.0%
79.2%
11.6%
22.4%
36.6%
53.7%

OpenVLA-OFT
76.8%
88.2%
91.1%
96.2%
43.0%
62.2%
77.8%
90.7%

UniVLA (Ours)
86.3%
90.4%
93.1%
95.6%
62.4%
71.4%
87.0%
92.0%

> [!NOTE]
> Real-world Experiments.

## :pencil: Citation
If you find our code or models useful in your work, please cite [our paper](https://arxiv.org/pdf/2505.06111):

```bibtex
@article{bu2025univla,
title={UniVLA: Learning to Act Anywhere with Task-centric Latent Actions},
author={Qingwen Bu and Yanting Yang and Jisong Cai and Shenyuan Gao and Guanghui Ren and Maoqing Yao and Ping Luo and Hongyang Li},
journal={arXiv preprint arXiv:2505.06111},
year={2025}
}
```

## Acknowledgements

We thank [OpenVLA](https://github.com/openvla/openvla) for their open-sourced work!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/opendrivelab/univla

Awesome Lists containing this project

README