An open API service indexing awesome lists of open source software.

https://github.com/fuse-model/FuSe


https://github.com/fuse-model/FuSe

Last synced: 8 months ago
JSON representation

Awesome Lists containing this project

README

          

# Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding

[![HF Models](https://img.shields.io/badge/%F0%9F%A4%97-Models-yellow)](https://huggingface.co/oier-mees/FuSe)
[![HF Dataset](https://img.shields.io/badge/%F0%9F%A4%97-Dataset-yellow)](https://huggingface.co/datasets/oier-mees/FuSe)
[![Python](https://img.shields.io/badge/python-3.10-blue)](https://www.python.org)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Static Badge](https://img.shields.io/badge/Project-Page-a)](https://fuse-model.github.io/)

[Joshua Jones](https://www.linkedin.com/in/joshua-w-jones/), [Oier Mees](https://www.oiermees.com/), [Carmelo Sferrazza](https://sferrazza.cc/), [Kyle Stachowicz](https://kylesta.ch/), [Pieter Abbeel](https://people.eecs.berkeley.edu/~pabbeel/), [Sergey Levine](https://people.eecs.berkeley.edu/~svlevine/)


This repo contains code to **Fu**se heterogeneous **Se**nsory (FuSE) data, like touch sensing or audio, into generalist robot policies via language grounding. We release both a dataset of 26,866 robot trajectories collected heterogeneous sensory modalities and checkpoints for our two main models: Octo a large diffusion-based transformer model and a 3B VLA based on PaliGemma.
Our code is built on top of the [Octo](https://github.com/octo-models/octo) and [PaliVLA](https://github.com/kylestach/bigvision-palivla) codebases.

![FuSE model](media/teaser.jpg)

## Get Started
Install PaliVLA:
```
cd palivla_digit
uv venv
source .venv/bin/activate
uv sync --extra [gpu or tpu]
uv pip install -e ../octo_digit --no-deps
uv pip install -e ../bridge_with_digit/widowx_envs
uv pip install -e .
```

Install Octo:
```
cd octo_digit
uv venv
source .venv/bin/activate
uv sync --extra [gpu or tpu]
uv pip install -e ../bridge_with_digit/widowx_envs
uv pip install -e .
```

# Dataset Download
We provide a dataset containing 26,866 trajectories collected on a WidowX robot at the RAIL lab @ UC Berkeley, USA. It contains visual, tactile, sound and action data collected across several environments, annotated with natural language.
You can download the dataset from the following [HuggingFace dataset](https://huggingface.co/datasets/oier-mees/FuSe).

# Model Training
For Octo:
```bash
python octo_digit/scripts/finetune_fuse.py --config=scripts/configs/fuse_config.py
```
For PaliVLA:
```bash
python palivla_digit/palivla/train_fuse.py --config=palivla_digit/palivla/configs/fuse_config.py
```

# Inference with Pretrained Models
Install `bridge_with_digit` on the robot controller, and start the action server.

Download the pretrained models from the [HuggingFace model hub](https://huggingface.co/oier-mees/FuSe).

For Octo:
```bash
python octo_digit/eval/fuse_eval.py --checkpoint_weights_path=ckpt.pth
```
For PaliVLA:
```bash
python palivla_digit/eval_palivla.py --checkpoint_dir=ckpt.pth
```

# License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. PaliVLA is licensed under the Apache 2.0 License - see the [LICENSE](palivla_digit/LICENSE) file for details.

## Citation

```bibtex
@article{jones2025fuse,
title={Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding},
author={Jones, Joshua and Mees, Oier and Sferrazza, Carmelo and Stachowicz, Kyle and Abbeel, Pieter and Levine, Sergey},
journal={arXiv preprint arXiv:2501.04693},
year={2025}
}
```