https://github.com/cambrian-mllm/cambrian-s

Cambrian-S: Towards Spatial Supersensing in Video
https://github.com/cambrian-mllm/cambrian-s

computer-vision llm multimodal-large-language-models spatial-understanding vision-language-model

Last synced: 5 months ago
JSON representation

Cambrian-S: Towards Spatial Supersensing in Video

Host: GitHub
URL: https://github.com/cambrian-mllm/cambrian-s
Owner: cambrian-mllm
License: apache-2.0
Created: 2025-10-13T21:46:59.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-12-22T05:57:48.000Z (6 months ago)
Last Synced: 2025-12-23T17:08:14.187Z (6 months ago)
Topics: computer-vision, llm, multimodal-large-language-models, spatial-understanding, vision-language-model
Language: Python
Homepage: https://cambrian-mllm.github.io/
Size: 4.2 MB
Stars: 436
Watchers: 5
Forks: 14
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


#  *Cambrian-S*:
 Towards Spatial Supersensing in Video



    





    





    





    





    





    





    Shusheng Yang*,

    Jihan Yang*,

    Pinzhi Huang†,

    Ellis Brown†,

    Zihao Yang,

    


    Yue Yu,

    Shengbang Tong,

    Zihan Zheng,

    Yifan Xu,

    Muhan Wang,

    Daohan Lu,

    


    Rob Fergus,

    Yann LeCun,

    Li Fei-Fei,

    Saining Xie





*Equal Contribution    †Core Contributor





## Release

- [Dec 21, 2025] 🚀 [Cambrian-S-3M](https://huggingface.co/datasets/nyu-visionx/Cambrian-S-3M) (our collection of 3M open-sourced video instruction tuning data) is now available! Please check it out!

- [Nov 6, 2025] 🔥 We release Cambrian-S model weights, training code, and evaluation suite.

- [Nov 6, 2025] 🔥 We release VSI-SUPER, a benchmark designed for spatial supersensing.

- [Nov 6, 2025] 🔥 We release VSI-590K, a dataset curated for spatial sensing.

## Contents

- [ *Cambrian-S*: Towards Spatial Supersensing in Video](#-cambrian-s-towards-spatial-supersensing-in-video)

  - [Release](#release)

  - [Contents](#contents)

  - [Cambrian-S Weights](#cambrian-s-weights)

    - [General Model Performance](#general-model-performance)

    - [VSI-SUPER Performance](#vsi-super-performance)

    - [Model Card](#model-card)

      - [Model Trained with Predictive Sensing](#model-trained-with-predictive-sensing)

      - [Standard MLLM Models](#standard-mllm-models)

  - [VSI-590K Dataset](#vsi-590k-dataset)

  - [Train](#train)

  - [Evaluation](#evaluation)

  - [Citation](#citation)

  - [Related Projects](#related-projects)

## Cambrian-S Weights

Here are our Cambrian-S checkpoints along with instructions on how to use the weights. Our models excel at spatial reasoning in video understanding, demonstrating significant improvements over previous state-of-the-art methods on spatial understanding benchmarks while maintaining competitive performance on general video understanding tasks.

### General Model Performance

Comparison of Cambrian-S with other leading MLLMs on general video understanding benchmarks.



    



**Results**: Cambrian-S maintains competitive performance on standard video benchmarks (Perception Test and EgoSchema) while excelling at spatial reasoning tasks.

### VSI-SUPER Performance

VSI-SUPER performance is evaluated on **Cambrian-S-7B-LFP**. 



    

    



### Model Card

#### Model Trained with Predictive Sensing

| Model           | Base-LLM | Vision Encoder | Hugging Face                                                    |

|-----------------|------------|----------------|------------------------------------------------------------------|

| Cambrian-S-7B-LFP   | `Qwen2.5-7B-Instruct`         | `siglip2-so400m-patch14-384`     | [nyu-visionx/Cambrian-S-7B-LFP](https://huggingface.co/nyu-visionx/Cambrian-S-7B-LFP)   |

#### Standard MLLM Models

| Model           | Base-LLM | Vision Encoder | Hugging Face                                                    |

|-----------------|------------|----------------|------------------------------------------------------------------|

| Cambrian-S-7B   | `Qwen2.5-7B-Instruct`         | `siglip2-so400m-patch14-384`     | [nyu-visionx/Cambrian-S-7B](https://huggingface.co/nyu-visionx/Cambrian-S-7B)   |

| Cambrian-S-3B   | `Qwen2.5-3B-Instruct`         | `siglip2-so400m-patch14-384`     | [nyu-visionx/Cambrian-S-3B](https://huggingface.co/nyu-visionx/cambrian-s-3b)   |

| Cambrian-S-1.5B | `Qwen2.5-1.5B-Instruct`       | `siglip2-so400m-patch14-384`     | [nyu-visionx/Cambrian-S-1.5B](https://huggingface.co/nyu-visionx/cambrian-s-1.5b) |

| Cambrian-S-0.5B | `Qwen2.5-0.5B-Instruct`       | `siglip2-so400m-patch14-384`     | [nyu-visionx/Cambrian-S-0.5B](https://huggingface.co/nyu-visionx/cambrian-s-0.5b) | 

## VSI-590K Dataset

VSI-590K is a video instruction-tuning dataset focusing on spatial understanding. 



    



**VSI-590K dataset statistics.** 



    



QAs are grouped by: question types (left) and task groups (right).



    

    



**Hugging Face**: [nyu-visionx/VSI-590K](https://huggingface.co/datasets/nyu-visionx/vsi-590k)

## Train

### Environment Preparation

Currently, we support training on TPU using TorchXLA. Install `TorchXLA 2.6.0` by the following commands:

```bash

pip install torch==2.6.0 torchvision==0.21.0 torch_xla==2.6.0

pip install 'torch_xla[tpu]' -f https://storage.googleapis.com/libtpu-releases/index.html

pip install 'torch_xla[pallas]' -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html

pip install --upgrade pip

pip install -e '.[tpu]'

```

### Data Preparation

Cambrian-S models are trained on top of [`Cambrian-Alignment`](https://huggingface.co/datasets/nyu-visionx/Cambrian-Alignment), [`Cambrian-7M`](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M), [`Cambrian-S-3M`](https://huggingface.co/datasets/nyu-visionx/Cambrian-S-3M), and [`VSI-590K`](https://huggingface.co/datasets/nyu-visionx/VSI-590K) datasets. Please prepare these datasets following their corresponding guidelines.

### Training Scripts

As mentioned in our paper, Cambrian-S models are trained in 4 stages: from vision-language alignment, to general image instruction tuning, and general video instruction tuning, and finally spatial video tuning. For Cambrian-S-LFP model, we modified the 4th stage by involving latent frame prediction objective. We provides sample training scripts in the following:

* [cambrian/scripts/cambrians_7b_s1.sh](cambrian/scripts/cambrians_7b_s1.sh)

* [cambrian/scripts/cambrians_7b_s2.sh](cambrian/scripts/cambrians_7b_s2.sh)

* [cambrian/scripts/cambrians_7b_s3.sh](cambrian/scripts/cambrians_7b_s3.sh)

* [cambrian/scripts/cambrians_7b_s4.sh](cambrian/scripts/cambrians_7b_s4.sh)

* [cambrian/scripts/cambrians_7b_lfp_s4.sh](cambrian/scripts/cambrians_7b_lfp_s4.sh)

## Evaluation

We have released our evaluation code in the [`lmms-eval/`](lmms-eval/) subfolder. Please see the README there for more details.

For detailed benchmark results, please refer to the [General Model Performance](#general-model-performance) and [VSI-SUPER Performance](#vsi-super-performance) sections above.

## Citation

If you find our work useful for your research, please consider to cite our work:

```bibtex

@article{yang2025cambrians,

  title={Cambrian-S: Towards Spatial Supersensing in Video},

  author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and Yang, Zihao and Yu, Yue and Tong, Shengbang and Zheng, Zihan and Xu, Yifan and Wang, Muhan and Lu, Daohan and Fergus, Rob and LeCun, Yann and Fei-Fei, Li and Xie, Saining},

  journal={arXiv preprint arXiv:2511.04670},

  year={2025}

}

@article{brown2025shortcuts,

  author = {Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining},

  title = {Benchmark Designers Should ``Train on the Test Set'' to Expose Exploitable Non-Visual Shortcuts},

  journal = {arXiv preprint arXiv:2511.04655},

  year = {2025}

}

@article{brown2025simsv,

  title   =  { {SIMS-V}: Simulated Instruction-Tuning for Spatial Video Understanding },

  author  =  { Brown, Ellis and Ray, Arijit and Krishna, Ranjay and Girshick, Ross and Fergus, Rob and Xie, Saining },

  journal =  { arXiv preprint arXiv:2511.04668 },

  year    =  { 2025 }

}

@article{yang2024think,

    title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}},

    author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali W. and Han, Rilyn and Fei-Fei, Li and Xie, Saining},

    year={2024},

    journal={arXiv preprint arXiv:2412.14171},

}

@article{tong2024cambrian,

  title={{Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs}},

  author={Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng, and Iyer, Adithya and Pan, Xichen and Wang, Austin and Fergus, Rob and LeCun, Yann and Xie, Saining},

  journal={arXiv preprint arXiv:2406.16860},

  year={2024}

}

```

## Related Projects

- [Cambrian-1](https://github.com/cambrian-mllm/cambrian): A Fully Open, Vision-Centric Exploration of Multimodal LLMs

- [Thinking in Space](https://vision-x-nyu.github.io/thinking-in-space.github.io/): How Multimodal Large Language Models See, Remember and Recall Spaces - Introduces VSI-Bench for evaluating visual-spatial intelligence

- [SIMS-V](https://ellisbrown.github.io/sims-v): Simulated Instruction-Tuning for Spatial Video Understanding

- [Test-Set Stress-Test](https://vision-x-nyu.github.io/test-set-training): Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cambrian-mllm/cambrian-s

Awesome Lists containing this project

README