Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://ldynx.github.io/SAVE/


https://ldynx.github.io/SAVE/

Last synced: about 1 month ago
JSON representation

Awesome Lists containing this project

README

        

# SAVE: Protagonist Diversification with Structure Agnostic Video Editing (ECCV 2024)

This repository contains the official implementation of
[SAVE: Protagonist Diversification with Structure Agnostic Video Editing](https://arxiv.org/abs/2312.02503).

[![Project Website](https://img.shields.io/badge/Project-Website-orange)](https://ldynx.github.io/SAVE/)
[![arXiv 2312.02503](https://img.shields.io/badge/arXiv-2312.02503-red)](https://arxiv.org/abs/2312.02503)

## Teaser

🐱 A cat is roaring ➜ 🐶 A dog is < Smot > / 🐯 A tiger is < Smot >



😎 A man is skiing ➜ 🐻 A bear is < Smot > / 🐭 Mickey-Mouse is < Smot >




SAVE reframes the video editing task as a motion inversion problem, seeking to find the motion word < Smot > in textual embedding space to well represent the motion in a source video. The video editing task can be achieved by isolating the motion from a single source video with < Smot > and then modifying the protagonist accordingly.

## Setup
### Requirements
```
pip install -r requirements.txt
```

### Weights
We use [Stable Diffusion v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) as our base text-to-image model and fine-tune it on a reference video for text-to-video generation. Example video weights are available at [GoogleDrive](https://drive.google.com/drive/folders/1ytqzQ7aKBiiSQxDSbDPn2i-6zwdbUFsw).

### Training
To fine-tune the text-to-image diffusion models on a custom video, run this command:
```
python run_train.py --config configs/-train.yaml
```
Configuration file `-train.yaml` contains the following arguments:
* `output_dir` - Directory to save the weights.
* `placeholder_tokens` - Pseudo words separated by `|` e.g., `|`.
* `initializer_tokens` - Initialization words separated by `|` e.g., `cat|roaring`.
* `sentence_component` - Use `` for appearance words and `` for motion words e.g., `|`.
* `num_s1_train_epochs` - Number of epochs for appearance pre-registration.
* `exp_localization_weight` - Weight for the cross-attention loss (recommended range is 1e-4 to 5e-4).
* `train_data: video_path` - Path to the source video.
* `train_data: prompt` - Source prompt that includes the pseudo words in `placeholder_tokens` e.g., `a cat is `.
* `n_sample_frames` - Number of frames.

## Video Editing
Once the updated weights are prepared, run this command:
```
python run_inference.py --config configs/-inference.yaml
```
Configuration file `-inference.yaml` contains the following arguments:
* `pretrained_model_path` - Directory to the saved weights.
* `image_path` - Path to the source video.
* `placeholder_tokens` - Pseudo words separated by `|` e.g., `|`.
* `sentence_component` - Use `` for appearance words and `` for motion words e.g., `|`.
* `prompt` - Source prompt that includes the pseudo words in `placeholder_tokens` e.g., `a cat is `.
* `prompts` - List of source and editing prompts e.g., [`a cat is `, `a dog is `].
* `blend_word` - List of protagonists in the source and edited videos e.g., [`cat`, `dog`].

## Citation

```
@inproceedings{song2025save,
title={Save: Protagonist diversification with structure agnostic video editing},
author={Song, Yeji and Shin, Wonsik and Lee, Junsoo and Kim, Jeesoo and Kwak, Nojun},
booktitle={European Conference on Computer Vision},
pages={41--57},
year={2025},
organization={Springer}
}
```

## Acknowledgements
This code builds upon [diffusers](https://github.com/huggingface/diffusers), [Tune-A-Video](https://github.com/showlab/Tune-A-Video) and [Video-P2P](https://github.com/dvlab-research/Video-P2P). Thank you for open-sourcing!