https://github.com/baaivision/ursa
π» Uniform Discrete Diffusion with Metric Path for Video Generation
https://github.com/baaivision/ursa
diffusion-forcing discrete-diffusion image-generation video-generation
Last synced: 5 months ago
JSON representation
π» Uniform Discrete Diffusion with Metric Path for Video Generation
- Host: GitHub
- URL: https://github.com/baaivision/ursa
- Owner: baaivision
- License: apache-2.0
- Created: 2025-10-17T00:31:47.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2026-01-15T06:26:52.000Z (5 months ago)
- Last Synced: 2026-01-15T12:46:13.060Z (5 months ago)
- Topics: diffusion-forcing, discrete-diffusion, image-generation, video-generation
- Language: Python
- Homepage:
- Size: 9.64 MB
- Stars: 89
- Watchers: 1
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

π» URSA: Uniform Discrete Diffusion with Metric Path
for Video Generation
[Haoge Deng](https://scholar.google.com/citations?user=S2sbvjgAAAAJ&hl)1,4*, [Ting Pan](https://scholar.google.com/citations?&user=qQv6YbsAAAAJ)2,4*, [Fan Zhang](https://scholar.google.com/citations?user=VsJ39HMAAAAJ)4*, [Yang Liu](https://scholar.google.com/citations?user=9JcQ2hwAAAAJ&hl)3,4*, [Zhuoyan Luo](https://scholar.google.com/citations?user=mKQhEsIAAAAJ&hl)4, [Yufeng Cui](https://scholar.google.com/citations?user=5Ydha2EAAAAJ&hl)4, [Wenxuan Wang](https://scholar.google.com/citations?user=75OyC-oAAAAJ&hl)4
[Chunhua Shen](https://scholar.google.com/citations?user=Ljk2BvIAAAAJ&hl)3, [Shiguang Shan](https://scholar.google.com/citations?user=Vkzd7MIAAAAJ&hl)2, [Zhaoxiang Zhang](https://scholar.google.com/citations?user=qxWfV6cAAAAJ&hl)1β , [Xinlong Wang](https://scholar.google.com/citations?user=DPz0DjYAAAAJ&hl)4β
[CASIA](http://english.ia.cas.cn)1, [CASICT](http://english.ict.cas.cn)2, [ZJU](https://www.zju.edu.cn/english)3, [BAAI](https://www.baai.ac.cn/en)4
* Equal Contribution, β Corresponding Author
We present **URSA** (**U**niform disc**R**ete diffu**S**ion with metric p**A**th), a simple yet powerful framework that bridges the gap with continuous approaches. **URSA** formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens and scales efficiently to long video generation, requiring fewer inference steps. **URSA** enables multi-task video generation with asynchronous timestep scheduling strategy in one unified model.
## π News
- ```[Jan 2026]``` Released [Training Guide](./docs/training.md).
- ```[Oct 2025]``` π URSA is part of [Emu3.5](https://github.com/baaivision/Emu3.5) as DiDA (Discrete Diffusion Adaptation)!
- ```[Oct 2025]``` Released TI2V π€ Demo.
- ```[Oct 2025]``` Released [Paper](https://arxiv.org/abs/2510.24717) & [Project Page](http://bitterdhg.github.io/URSA_page) & [Evaluation Guide](./docs/evaluation.md).
## β¨Hightlights
- π₯ **Novel Approach**: Uniform Discrete Diffusion with Metric Path.
- π₯ **SOTA Performance**: High efficiency with state-of-the-art T2I/T2V/I2V results.
- π₯ **Unified Modeling**: Multi-task capabilities in a single unified model.
## ποΈ Models
### πΌοΈ Text to Image
| Model | Resolution | Data | Weight | GenEval | DPGBench |
|:-----:|:----------:|:----:|:------:|:-------:|:--------:|
| URSA-0.6B-IBQ1024 | 1024x1024 | 30M | [π€ HF](https://huggingface.co/BAAI/URSA-0.6B-IBQ1024) \| [π€ ModelScope](https://www.modelscope.cn/models/BAAI/URSA-0.6B-IBQ1024) | 0.79 | 85.6 |
| URSA-1.7B-IBQ1024 | 1024x1024 | 30M | [π€ HF](https://huggingface.co/BAAI/URSA-1.7B-IBQ1024) \| [π€ ModelScope](https://www.modelscope.cn/models/BAAI/URSA-1.7B-IBQ1024) | 0.80 | 86.0 |
### π¬ Text to Video
| Model | Resolution | Data | Weight | VBench-T2V | VBench-I2V |
|:-----:|:----------:|:----:|:------:|:----------:|:----------:|
| URSA-0.6B-FSQ320 | 49x512x320 | 24M | [π€ HF](https://huggingface.co/BAAI/URSA-0.6B-FSQ320) \| [π€ ModelScope](https://www.modelscope.cn/models/BAAI/URSA-0.6B-FSQ320) | 81.4 | 86.0 |
| URSA-1.7B-FSQ320 | 49x512x320 | 24M | [π€ HF](https://huggingface.co/BAAI/URSA-1.7B-FSQ320) \| [π€ ModelScope](https://www.modelscope.cn/models/BAAI/URSA-1.7B-FSQ320) | 82.4 | 86.2 |
## π Table of Contents
- [π§ Installation](#installation)
- [π₯ Quick Start](#quick-start)
- [πΌοΈ Image Generation](#quickstart-image-generation)
- [π¬ Video Generation](#quickstart-video-generation)
- [π» Gradio Demo](#gradio-demo)
- [π― Evaluation](./docs/evaluation.md)
- [π€ Training](./docs/training.md)
Clone this repository to local disk and install:
```bash
pip install diffusers transformers>=4.57.1 accelerate imageio imageio-ffmpeg omegaconf wandb
git clone https://github.com/baaivision/URSA.git
cd URSA && pip install .
```
```python
import torch
from diffnext.pipelines import URSAPipeline
model_id, height, width = "BAAI/URSA-1.7B-IBQ1024", 1024, 1024
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = URSAPipeline.from_pretrained(model_id, **model_args)
pipe = pipe.to(torch.device("cuda"))
prompt = "The bear, calm and still, gazes upward as if lost in contemplation of the cosmos."
negative_prompt = "worst quality, low quality, inconsistent motion, static, still, blurry, jittery, distorted, ugly"
image = pipe(**locals()).frames[0]
image.save("ursa.jpg")
```
```python
import os, torch, numpy
from diffnext.pipelines import URSAPipeline
from diffnext.utils import export_to_video
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
model_id, height, width = "BAAI/URSA-1.7B-FSQ320", 320, 512
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = URSAPipeline.from_pretrained(model_id, **model_args)
pipe = pipe.to(torch.device("cuda"))
text_prompt = "a lone grizzly bear walks through a misty forest at dawn, sunlight catching its fur."
negative_prompt = "worst quality, low quality, inconsistent motion, static, still, blurry, jittery, distorted, ugly"
# Text-to-Image
prompt = text_prompt
num_frames, num_inference_steps = 1, 25
image = pipe(**locals()).frames[0]
image.save("ursa.jpg")
# Image-to-Video
prompt = f"motion=9.0, {text_prompt}"
num_frames, num_inference_steps = 49, 50
video = pipe(**locals()).frames[0]
export_to_video(video, "ursa_1+48f.mp4", fps=12)
# Text-to-Video
image, video = None, None
prompt = f"motion=9.0, {text_prompt}"
num_frames, num_inference_steps = 49, 50
video = pipe(**locals()).frames[0]
export_to_video(video, "ursa_49f.mp4", fps=12)
# Video-to-Video
prompt = f"motion=5.0, {text_prompt}"
num_frames, num_inference_steps = 49, 50
num_cond_frames, cond_noise_scale = 13, 0.1
for i in range(12):
video, start_video = video[-num_cond_frames:], video
video = pipe(**locals()).frames[0]
video = numpy.concatenate([start_video, video[num_cond_frames:]])
export_to_video(video, "ursa_{}f.mp4".format(video.shape[0]), fps=12)
```
```bash
# Text-to-Image (T2I)
python scripts/app_ursa_t2i.py --model "BAAI/URSA-1.7B-IBQ1024" --device 0
# Text-to-Image-to-Video (TI2V)
python scripts/app_ursa_ti2v.py --model "BAAI/URSA-1.7B-FSQ320" --device 0
```
## π Todo List
- [X] [Model Zoo](#model-zoo)
- [X] [Quick Start](#quick-start)
- [X] [Gradio Demo](#gradio-demo)
- [X] [Evaluation Guide](./docs/evaluation.md)
- [X] [Training Guide](./docs/training.md)
- [ ] 4B Model
## π Citation
If you find this repository useful, please consider giving a star β and citation π¦:
```
@article{deng2025ursa,
title={Uniform Discrete Diffusion with Metric Path for Video Generation},
author={Deng, Haoge and Pan, Ting and Zhang, Fan and Liu, Yang and Luo, Zhuoyan and Cui, Yufeng and Shen, Chunhua and Shan, Shiguang and Zhang, Zhaoxiang and Wang, Xinlong},
journal={arXiv preprint arXiv:2510.24717},
year={2025}
}
```
```
@article{deng2024nova,
title={Autoregressive Video Generation without Vector Quantization},
author={Deng, Haoge and Pan, Ting and Diao, Haiwen and Luo, Zhengxiong and Cui, Yufeng and Lu, Huchuan and Shan, Shiguang and Qi, Yonggang and Wang, Xinlong},
journal={arXiv preprint arXiv:2412.14169},
year={2024}
}
```
## π€ Acknowledgement
We thank the repositories:
- [NOVA](https://github.com/baaivision/NOVA). β¨NOVA is the predecessor of π»URSA.
- [FlowMatching](https://github.com/facebookresearch/flow_matching). This codebase systemically provides CFM and DFM implementations.
- [FUDOKI](https://github.com/fudoki-hku/FUDOKI). This codebase provides a naive multimodal DFM implementation.
- [CodeWithGPU](https://github.com/seetacloud/codewithgpu). CodeWithGPU library is the core of our data loading pipeline.
## License
Code and models are licensed under [Apache License 2.0](LICENSE).