https://github.com/nnanhuang/SegAnyMo
[CVPR 2025] Code for Segment Any Motion in Videos
https://github.com/nnanhuang/SegAnyMo
moving-object-segmentation sam2 tracking video video-processing
Last synced: about 1 month ago
JSON representation
[CVPR 2025] Code for Segment Any Motion in Videos
- Host: GitHub
- URL: https://github.com/nnanhuang/SegAnyMo
- Owner: nnanhuang
- License: mit
- Created: 2024-12-25T09:28:57.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-03-31T01:22:48.000Z (about 2 months ago)
- Last Synced: 2025-03-31T02:25:46.461Z (about 2 months ago)
- Topics: moving-object-segmentation, sam2, tracking, video, video-processing
- Language: Jupyter Notebook
- Homepage: https://motion-seg.github.io/
- Size: 41 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# [CVPR2025] Segment Any Motion in Videos
**[Project Page](https://motion-seg.github.io/) | [Arxiv](https://arxiv.org/abs/2503.22268)**[Nan Huang](https://github.com/nnanhuang)1,2,
[Wenzhao Zheng](https://wzzheng.net/)1,
[Chenfeng Xu](https://www.chenfengx.com/)1,
[Kurt Keutzer](https://people.eecs.berkeley.edu/~keutzer/)1,
[Shanghang Zhang](https://www.shanghangzhang.com/)2,
[Angjoo Kanazawa](https://people.eecs.berkeley.edu/~kanazawa/)1,[Qianqian Wang](https://qianqianwang68.github.io/)11UC Berkeley 2Peking University
Overview of Our Pipeline. We take 2D tracks and depth maps generated by off-the-shelf models as input, which are then
processed by a motion encoder to capture motion patterns, producing featured tracks. Next, we use tracks decoder that integrates DINO
feature to decode the featured tracks by decoupling motion and semantic information and ultimately obtain the dynamic trajectories(a).
Finally, using SAM2, we group dynamic tracks belonging to the same object and generate fine-grained moving object masks(b).## Contents
This repository contains the code for Segment Any Motion in Videos.- [Installation](#installation)
- [Usage](#usage)
- [Preprocessing](#preprocessing)
- [Tracks Label Prediction](#tracks-label-prediction)
- [Mask Densification Using SAM2](#mask-densification-using-sam2)
- [Evaluation](#evaluation)
- [Download Pre-computed Results](#download-pre-computed-results)
- [MOS task evaluation](#mos-task-evaluation)
- [Fine-grained MOS task evaluation](#fine-grained-mos-task-evaluation)
- [Model Training](#model-training)
- [Preprocess-data](#preprocess-data)
- [Training](#training)## Installation
Our code is developed on Ubuntu 22.04 using Python 3.12 and PyTorch 2.4.0+cu121 on a NVIDIA RTX A6000. Please note that the code has only been tested with these specified versions. We recommend using conda for the installation of dependencies.```bash
git clone --recurse-submodules https://github.com/nnanhuang/SegAnyMo
cd SegAnyMo/
conda create -n seg python=3.12.4
conda activate seg
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
pip install -U xformers --index-url https://download.pytorch.org/whl/cu121
```- downloading DINOv2 for preprocessing.
```bash
cd preproc
git clone https://github.com/facebookresearch/dinov2
```- downloading the environment & checkpoints for sam2, tapnet
```bash
# SAM2 (sam2_hiera_large.pt)
cd sam2
pip install -e .cd checkpoints && \
./download_ckpts.sh && \
cd ../..# install tapnet
cd preproc/tapnet
pip install .cd ..
mkdir checkpoints
cd checkpoints
wget https://storage.googleapis.com/dm-tapnet/bootstap/bootstapir_checkpoint_v2.pt```
## Usage
You can use our method very simply with three lines of code:
```bash
# Make sure you set up the environment and download all the checkpoints.
# Make sure you already write the model ckpt path to the config file.# 1.Data Preprocessing:
python core/utils/run_inference.py --video_path $VIDEO_PATH --gpus $GPU_ID --depths --tracks --dinos --e
# 2.Predicting Per-Track Motion Labels:
python core/utils/run_inference.py --video_path $VIDEO_PATH --motin_seg_dir $OUTPUT_DIR --config_file $PATH --gpus $GPU_ID --motion_seg_infer --e
# 3.Generating the Final Masks:
python core/utils/run_inference.py --video_path $VIDEO_PATH --sam2dir $RESULT_DIR --motin_seg_dir $OUTPUT_DIR --gpus $GPU_ID --sam2 --e# For example:
python core/utils/run_inference.py --data_dir ./data/images --gpus 0 1 2 3 --depths --tracks --dinos --e
python core/utils/run_inference.py --data_dir ./data/images --motin_seg_dir ./result/moseg --config_file ./configs/example.yaml --gpus 0 1 2 3 --motion_seg_infer --e
python core/utils/run_inference.py --data_dir ./data/images --sam2dir ./result/sam2 --motin_seg_dir ./result/moseg --gpus 0 1 2 3 --sam2 --e
# or use --video_path ./data/video.mp4
```
Please see below for specific usage details.### Preprocessing
We depend on the following third-party libraries for preprocessing:
1. Monocular depth: [Depth Anything v2](https://github.com/DepthAnything/Depth-Anything-V2)
2. 2D Tracks: [TAPIR](https://github.com/google-deepmind/tapnet)
3. Dino feature: [DINO v2](https://github.com/facebookresearch/dinov2)- Processed root dirs should be organized as:
```
data
├── images
│ ├── scene_name
│ │ ├── image_name
│ │ ├── ...
├── bootstapir
│ ├── scene_name
│ │ ├── image_name
│ │ ├── ...
├── dinos
│ ├── scene_name
│ │ ├── image_name
│ │ ├── ...
├── depth_anything_v2
│ ├── scene_name
│ │ ├── image_name
│ │ ├── ...
```- We recommend enabling Efficiency Mode `--e` to accelerate data processing. This mode accelerate pipeline through frame rate reduction, interval sampling, and resolution scaling.
- During inference, we implement a stride of 10 `--step` when processing image sequences, meaning only every 10th frame (designated as Query Frames) will be considered valid and processed. The system exclusively operates on these selected Query Frames for all data processing tasks. You can improve model performance by set a smaller value.- You can generate depth image, dino features and 2d tracks by (~10 min) this code. Use `--data_dir` if your input is image sequences and use `--video_path` if your input is a video.
```bash
python core/utils/run_inference.py --data_dir $DATA_DIR --gpus $GPU_ID --depths --tracks --dinos --epython core/utils/run_inference.py --video_path $VIDEO_PATH --gpus $GPU_ID --depths --tracks --dinos --e
```
### Tracks Label Prediction
- First download the model checkpoints and write the path to the resume_path in configs/example_train.yaml. (the resume_path part)
- You can download from [huggingface](https://huggingface.co/Changearthmore/moseg),
- Or you can download the model checkpoints from [google drive](https://drive.google.com/file/d/15VWtEqsROKAxdZbzaXrrmCm4k1D8SJJR/view?usp=drive_link).- Running inference after process depth, dino features and 2d tracks. The predicted result will be saved at `motin_seg_dir`.
```bash
python core/utils/run_inference.py --data_dir $DATA_DIR --motin_seg_dir $OUTPUT_DIR --config_file $PATH --gpus $GPU_ID --motion_seg_infer --e
```### Mask Densification Using SAM2
- Run prediction and save mask result and video result. The `sam2dir` is where the SAM2 predicted mask is saved, the `data_dir` is the original rgb images dirs, and the `motin_seg_dir` is the results of Tracks Label Prediction Model which contains dynamic trajectories and visibilities.```bash
python core/utils/run_inference.py --data_dir $data_dir --sam2dir $result_dir --motin_seg_dir $tracks_label_result --gpus $GPU_ID --sam2 --e
```
- For coordinate: trajectory output is (x,y) and SAM2 input is also (x,y)
- Important: [SAM2](https://github.com/facebookresearch/sam2) official code hard-code the image sequence suffix is in ".jpg,.jpeg", and name as totally pure number. You can either change the image name or change the code. We change the code in this repo.## Evaluation
### Download Pre-computed Results
* The masks pre-compputed by us can be found [here](https://drive.google.com/file/d/1zpQJJ5nWQr3ezQ1Dc_RVjzNTce72QXq6/view?usp=drive_link).### MOS task evaluation
e.g.For DAVIS Dataset, we use the script below, you can use different $eval_seq_list to eval subset of DAVIS, like DAVIS2016-Moving.
```bash
CUDA_VISIBLE_DEVICES=7 python core/eval/eval_mask.py --res_dir $res-dir --eval_dir $gt-dir --eval_seq_list /$root-dir/core/utils/moving_val_sequences.txt
```
If you don't specify eval_seq_list, then will use full sequence list by default.### Fine-grained MOS task evaluation
```bash
cd core/eval/davis2017-evaluation
CUDA_VISIBLE_DEVICES=3 python evaluation_method.py --task unsupervised --results_path $mask_path
```## Model Training
### Preprocess data
We take HOI4D as an instance.
```bash
# preprocess images and dynamic masks
python core/utils/process_HOI.py
# preprocess else
python core/utils/run_inference.py --data_dir current-data-dir/kubric/movie_f/validation/images --gpus 0 1 2 3 4 5 6 7 --tracks --depths --dinos
```
- (optional) you can use this scripts to check if all data have been processed:
```bash
python current-data-dir/dynamic_stereo/dynamic_replica_data/check_process.py
```
- If you want to train on a custom dataset, the dataset should have gt rgb and dynamic mask.- (optional) after processing, clean the data to save memory.
```bash
python core/utils/run_inference.py --data_dir $data_dir --gpus 0 1 2 3 4 5 6 7 --clean
```### Training
- process Kubric dataset, we train on Kubric-Movie-F subset; process dynamic-stereo & HOI4D as above.
- When preprocess dynamic-stereo dataset, we use `cal_dynamic_mask.py` to get the groudtruth dynamic mask by caculate the groudtruth trajectory motion.- Train these datasets together:
```bash
CUDA_VISIBLE_DEVICES=3 python train_seq.py ./configs/$CONFIG.yaml
```## Citation
If you find our repo or paper useful, please cite us as:```
@misc{huang2025segmentmotionvideos,
title={Segment Any Motion in Videos},
author={Nan Huang and Wenzhao Zheng and Chenfeng Xu and Kurt Keutzer and Shanghang Zhang and Angjoo Kanazawa and Qianqian Wang},
year={2025},
eprint={2503.22268},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.22268},
}
```