https://github.com/cambrian-mllm/cambrian-s
Cambrian-S: Towards Spatial Supersensing in Video
https://github.com/cambrian-mllm/cambrian-s
computer-vision llm multimodal-large-language-models spatial-understanding vision-language-model
Last synced: 5 months ago
JSON representation
Cambrian-S: Towards Spatial Supersensing in Video
- Host: GitHub
- URL: https://github.com/cambrian-mllm/cambrian-s
- Owner: cambrian-mllm
- License: apache-2.0
- Created: 2025-10-13T21:46:59.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-12-22T05:57:48.000Z (6 months ago)
- Last Synced: 2025-12-23T17:08:14.187Z (6 months ago)
- Topics: computer-vision, llm, multimodal-large-language-models, spatial-understanding, vision-language-model
- Language: Python
- Homepage: https://cambrian-mllm.github.io/
- Size: 4.2 MB
- Stars: 436
- Watchers: 5
- Forks: 14
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
#
*Cambrian-S*:
Towards Spatial Supersensing in Video
Shusheng Yang*,
Jihan Yang*,
Pinzhi Huang†,
Ellis Brown†,
Zihao Yang,
Yue Yu,
Shengbang Tong,
Zihan Zheng,
Yifan Xu,
Muhan Wang,
Daohan Lu,
Rob Fergus,
Yann LeCun,
Li Fei-Fei,
Saining Xie
*Equal Contribution †Core Contributor
## Release
- [Dec 21, 2025] 🚀 [Cambrian-S-3M](https://huggingface.co/datasets/nyu-visionx/Cambrian-S-3M) (our collection of 3M open-sourced video instruction tuning data) is now available! Please check it out!
- [Nov 6, 2025] 🔥 We release Cambrian-S model weights, training code, and evaluation suite.
- [Nov 6, 2025] 🔥 We release VSI-SUPER, a benchmark designed for spatial supersensing.
- [Nov 6, 2025] 🔥 We release VSI-590K, a dataset curated for spatial sensing.
## Contents
- [ *Cambrian-S*: Towards Spatial Supersensing in Video](#-cambrian-s-towards-spatial-supersensing-in-video)
- [Release](#release)
- [Contents](#contents)
- [Cambrian-S Weights](#cambrian-s-weights)
- [General Model Performance](#general-model-performance)
- [VSI-SUPER Performance](#vsi-super-performance)
- [Model Card](#model-card)
- [Model Trained with Predictive Sensing](#model-trained-with-predictive-sensing)
- [Standard MLLM Models](#standard-mllm-models)
- [VSI-590K Dataset](#vsi-590k-dataset)
- [Train](#train)
- [Evaluation](#evaluation)
- [Citation](#citation)
- [Related Projects](#related-projects)
## Cambrian-S Weights
Here are our Cambrian-S checkpoints along with instructions on how to use the weights. Our models excel at spatial reasoning in video understanding, demonstrating significant improvements over previous state-of-the-art methods on spatial understanding benchmarks while maintaining competitive performance on general video understanding tasks.
### General Model Performance
Comparison of Cambrian-S with other leading MLLMs on general video understanding benchmarks.
**Results**: Cambrian-S maintains competitive performance on standard video benchmarks (Perception Test and EgoSchema) while excelling at spatial reasoning tasks.
### VSI-SUPER Performance
VSI-SUPER performance is evaluated on **Cambrian-S-7B-LFP**.
### Model Card
#### Model Trained with Predictive Sensing
| Model | Base-LLM | Vision Encoder | Hugging Face |
|-----------------|------------|----------------|------------------------------------------------------------------|
| Cambrian-S-7B-LFP | `Qwen2.5-7B-Instruct` | `siglip2-so400m-patch14-384` | [nyu-visionx/Cambrian-S-7B-LFP](https://huggingface.co/nyu-visionx/Cambrian-S-7B-LFP) |
#### Standard MLLM Models
| Model | Base-LLM | Vision Encoder | Hugging Face |
|-----------------|------------|----------------|------------------------------------------------------------------|
| Cambrian-S-7B | `Qwen2.5-7B-Instruct` | `siglip2-so400m-patch14-384` | [nyu-visionx/Cambrian-S-7B](https://huggingface.co/nyu-visionx/Cambrian-S-7B) |
| Cambrian-S-3B | `Qwen2.5-3B-Instruct` | `siglip2-so400m-patch14-384` | [nyu-visionx/Cambrian-S-3B](https://huggingface.co/nyu-visionx/cambrian-s-3b) |
| Cambrian-S-1.5B | `Qwen2.5-1.5B-Instruct` | `siglip2-so400m-patch14-384` | [nyu-visionx/Cambrian-S-1.5B](https://huggingface.co/nyu-visionx/cambrian-s-1.5b) |
| Cambrian-S-0.5B | `Qwen2.5-0.5B-Instruct` | `siglip2-so400m-patch14-384` | [nyu-visionx/Cambrian-S-0.5B](https://huggingface.co/nyu-visionx/cambrian-s-0.5b) |
## VSI-590K Dataset
VSI-590K is a video instruction-tuning dataset focusing on spatial understanding.
**VSI-590K dataset statistics.**
QAs are grouped by: question types (left) and task groups (right).
**Hugging Face**: [nyu-visionx/VSI-590K](https://huggingface.co/datasets/nyu-visionx/vsi-590k)
## Train
### Environment Preparation
Currently, we support training on TPU using TorchXLA. Install `TorchXLA 2.6.0` by the following commands:
```bash
pip install torch==2.6.0 torchvision==0.21.0 torch_xla==2.6.0
pip install 'torch_xla[tpu]' -f https://storage.googleapis.com/libtpu-releases/index.html
pip install 'torch_xla[pallas]' -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
pip install --upgrade pip
pip install -e '.[tpu]'
```
### Data Preparation
Cambrian-S models are trained on top of [`Cambrian-Alignment`](https://huggingface.co/datasets/nyu-visionx/Cambrian-Alignment), [`Cambrian-7M`](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M), [`Cambrian-S-3M`](https://huggingface.co/datasets/nyu-visionx/Cambrian-S-3M), and [`VSI-590K`](https://huggingface.co/datasets/nyu-visionx/VSI-590K) datasets. Please prepare these datasets following their corresponding guidelines.
### Training Scripts
As mentioned in our paper, Cambrian-S models are trained in 4 stages: from vision-language alignment, to general image instruction tuning, and general video instruction tuning, and finally spatial video tuning. For Cambrian-S-LFP model, we modified the 4th stage by involving latent frame prediction objective. We provides sample training scripts in the following:
* [cambrian/scripts/cambrians_7b_s1.sh](cambrian/scripts/cambrians_7b_s1.sh)
* [cambrian/scripts/cambrians_7b_s2.sh](cambrian/scripts/cambrians_7b_s2.sh)
* [cambrian/scripts/cambrians_7b_s3.sh](cambrian/scripts/cambrians_7b_s3.sh)
* [cambrian/scripts/cambrians_7b_s4.sh](cambrian/scripts/cambrians_7b_s4.sh)
* [cambrian/scripts/cambrians_7b_lfp_s4.sh](cambrian/scripts/cambrians_7b_lfp_s4.sh)
## Evaluation
We have released our evaluation code in the [`lmms-eval/`](lmms-eval/) subfolder. Please see the README there for more details.
For detailed benchmark results, please refer to the [General Model Performance](#general-model-performance) and [VSI-SUPER Performance](#vsi-super-performance) sections above.
## Citation
If you find our work useful for your research, please consider to cite our work:
```bibtex
@article{yang2025cambrians,
title={Cambrian-S: Towards Spatial Supersensing in Video},
author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and Yang, Zihao and Yu, Yue and Tong, Shengbang and Zheng, Zihan and Xu, Yifan and Wang, Muhan and Lu, Daohan and Fergus, Rob and LeCun, Yann and Fei-Fei, Li and Xie, Saining},
journal={arXiv preprint arXiv:2511.04670},
year={2025}
}
@article{brown2025shortcuts,
author = {Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining},
title = {Benchmark Designers Should ``Train on the Test Set'' to Expose Exploitable Non-Visual Shortcuts},
journal = {arXiv preprint arXiv:2511.04655},
year = {2025}
}
@article{brown2025simsv,
title = { {SIMS-V}: Simulated Instruction-Tuning for Spatial Video Understanding },
author = { Brown, Ellis and Ray, Arijit and Krishna, Ranjay and Girshick, Ross and Fergus, Rob and Xie, Saining },
journal = { arXiv preprint arXiv:2511.04668 },
year = { 2025 }
}
@article{yang2024think,
title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}},
author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali W. and Han, Rilyn and Fei-Fei, Li and Xie, Saining},
year={2024},
journal={arXiv preprint arXiv:2412.14171},
}
@article{tong2024cambrian,
title={{Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs}},
author={Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng, and Iyer, Adithya and Pan, Xichen and Wang, Austin and Fergus, Rob and LeCun, Yann and Xie, Saining},
journal={arXiv preprint arXiv:2406.16860},
year={2024}
}
```
## Related Projects
- [Cambrian-1](https://github.com/cambrian-mllm/cambrian): A Fully Open, Vision-Centric Exploration of Multimodal LLMs
- [Thinking in Space](https://vision-x-nyu.github.io/thinking-in-space.github.io/): How Multimodal Large Language Models See, Remember and Recall Spaces - Introduces VSI-Bench for evaluating visual-spatial intelligence
- [SIMS-V](https://ellisbrown.github.io/sims-v): Simulated Instruction-Tuning for Spatial Video Understanding
- [Test-Set Stress-Test](https://vision-x-nyu.github.io/test-set-training): Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts