https://github.com/yeliudev/r2-tuning

🌀 R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding (ECCV 2024)
https://github.com/yeliudev/r2-tuning

Last synced: 4 months ago
JSON representation

🌀 R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding (ECCV 2024)

Host: GitHub
URL: https://github.com/yeliudev/r2-tuning
Owner: yeliudev
License: bsd-3-clause
Created: 2024-04-02T06:00:18.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-07-02T04:19:07.000Z (11 months ago)
Last Synced: 2025-01-17T04:10:29.656Z (4 months ago)
Language: Python
Homepage: http://arxiv.org/abs/2404.00801
Size: 607 KB
Stars: 71
Watchers: 7
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# $\boldsymbol{R^2}$-Tuning

[![arXiv](https://badgen.net/badge/arXiv/2404.00801/red?cache=300)](https://arxiv.org/abs/2404.00801)
[![License](https://badgen.net/badge/License/BSD%203-Clause%20License?color=blue&cache=300)](https://github.com/yeliudev/R2-Tuning/blob/main/LICENSE)
[![Hugging Face Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/yeliudev/R2-Tuning)

[**Installation**](#-installation) | [**Dataset**](#-dataset) | [**Training**](#-training) | [**Evaluation**](#-evaluation) | [**Model Zoo**](#-model-zoo)

This repository maintains the official implementation of the paper **$\boldsymbol{R^2}$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding** by [Ye Liu](https://yeliu.dev/), [Jixuan He](https://openreview.net/profile?id=~Jixuan_He1), [Wanhua Li](https://li-wanhua.github.io/), [Junsik Kim](https://sites.google.com/site/jskimcv/), [Donglai Wei](https://donglaiw.github.io/), [Hanspeter Pfister](https://vcg.seas.harvard.edu/people/), and [Chang Wen Chen](https://web.comp.polyu.edu.hk/chencw/).

## 🔥 News

- **[2024.7.2]** Our paper has been accepted by ECCV 2024.
- **[2024.6.16]** Check out our [online demo](https://huggingface.co/spaces/yeliudev/R2-Tuning) on 🤗 Hugging Face Spaces.
- **[2024.6.15]** Add support for [single video inference](#-single-video-inference).
- **[2024.4.16]** Code and dataset release.
- **[2024.3.31]** Our tech report is available on [arXiv](https://arxiv.org/abs/2404.00801).

## 🔨 Installation

Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.

- CUDA 12.1
- FFmpeg 6.0
- Python 3.12.2
- PyTorch 2.2.1
- [NNCore](https://github.com/yeliudev/nncore) 0.4.2

### Install from source

1. Clone the repository from GitHub.

```shell
git clone https://github.com/yeliudev/R2-Tuning.git
cd R2-Tuning
```

2. Initialize conda environment.

```shell
conda create -n r2-tuning python=3.12 -y
conda activate r2-tuning
```

3. Install dependencies.

```shell
pip install -r requirements.txt
```

## 🔖 Dataset

#### Option 1 [Recommended]: Download pre-extracted features from [HuggingFace Hub](https://huggingface.co/yeliudev/R2-Tuning) directly.

```shell
# Prepare datasets in one command
bash tools/prepare_data.sh
```

#### Option 2: Reproduce our data pre-processing pipeline.

1. Download videos from the following links and place them into `data/{dataset}/videos`.

- [QVHighlights](https://nlp.cs.unc.edu/data/jielei/qvh/qvhilights_videos.tar.gz)
- [Ego4D-NLQ](https://ego4d-data.org/)
- [Charades-STA](https://ai2-public-datasets.s3-us-west-2.amazonaws.com/charades/Charades_v1.zip)
- [TACoS](https://datasets.d2.mpi-inf.mpg.de/MPII-Cooking-2/MPII-Cooking-2-videos.tar.gz)
- [YouTube Highlights](https://github.com/aliensunmin/DomainSpecificHighlight)
- [TVSum](https://people.csail.mit.edu/yalesong/tvsum/tvsum50_ver_1_1.tgz)

2. Extract and compress video frames at a fixed frame rate.

```shell
# For QVHighlights, Ego4D-NLQ, TACoS, and TVSum
python tools/extract_frames.py

# For Charades-STA
python tools/extract_frames.py --fps 1.0

# For YouTube Highlights
python tools/extract_frames.py --anno_path data/youtube/youtube_anno.json
```

Arguments of tools/extract_frames.py

- `video_dir` Path to the videos folder
- `--anno_path` Path to the annotation file (only for YouTube Highlights to compute frame rates)
- `--frame_dir` Path to the output extracted frames
- `--size` Side length of the cropped video frames
- `--fps` Frame rate to be used
- `--max_len` The maximum length of each video segment
- `--workers` Number of processes
- `--chunksize` The chunk size for each process

3. Extract features from video frames.

```shell
python tools/extract_feat.py
```

Arguments of tools/extract_feat.py

- `anno_path` Path to the annotation file
- `frame_dir` Path to the extracted frames
- `--video_feat_dir` Path to the output video features
- `--query_feat_dir` Path to the output query features
- `--arch` CLIP architecture to use (`ViT-B/32`, `ViT-B/16`, `ViT-L/14`, `ViT-L/14-336px`)
- `--k` Save the last `k` layers features
- `--batch_size` The batch size to use
- `--workers` Number of workers for data loader

#### The prepared dataset should be in the following structure.

```
R2-Tuning
├── configs
├── datasets
├── models
├── tools
├── data
│ ├── qvhighlights
│ │ ├── frames_224_0.5fps (optional)
│ │ ├── clip_b32_{vid,txt}_k4
│ │ └── qvhighlights_{train,val,test}.jsonl
│ ├── ego4d
│ │ ├── frames_224_0.5fps (optional)
│ │ ├── clip_b32_{vid,txt}_k4
│ │ └── nlq_{train,val}.jsonl
│ ├── charades
│ │ ├── frames_224_1.0fps (optional)
│ │ ├── clip_b32_{vid,txt}_k4
│ │ └── charades_{train,test}.jsonl
│ ├── tacos
│ │ ├── frames_224_0.5fps (optional)
│ │ ├── clip_b32_{vid,txt}_k4
│ │ └── {train,val,test}.jsonl
│ ├── youtube
│ │ ├── frames_224_auto (optional)
│ │ ├── clip_b32_{vid,txt}_k4
│ │ └── youtube_anno.json
│ └── tvsum
│ ├── frames_224_0.5fps (optional)
│ ├── clip_b32_{vid,txt}_k4
│ └── tvsum_anno.json
├── README.md
├── setup.cfg
└── ···
```

## 🔮 Training

Use the following commands to train a model with a specified config.

```shell
# Single GPU
python tools/launch.py

# Multiple GPUs on a single node (elastic)
torchrun --nproc_per_node= tools/launch.py

# Multiple GPUs on multiple nodes (slurm)
srun python tools/launch.py
```

Arguments of tools/launch.py

- `config` The config file to use
- `--checkpoint` The checkpoint file to load from
- `--resume` The checkpoint file to resume from
- `--work_dir` Working directory
- `--eval` Evaluation only
- `--dump` Dump inference outputs
- `--seed` The random seed to use
- `--amp` Whether to use automatic mixed precision training
- `--debug` Debug mode (detect `nan` during training)
- `--launcher` The job launcher to use

Please refer to the [configs](https://github.com/yeliudev/R2-Tuning/tree/main/configs) folder for detailed settings of each model.

## 🏆 Evaluation

Use the following command to test a model and evaluate results.

```
python tools/launch.py --checkpoint --eval
```

For QVHighlights, you may also dump inference outputs on `val` and `test` splits.

```
python tools/launch.py --checkpoint --dump
```

Then you can pack the `hl_{val,test}_submission.jsonl` files and submit them to [CodaLab](https://codalab.lisn.upsaclay.fr/competitions/6937).

## 💻 Single Video Inference

> [!WARNING]
> This feature is only compatible with `nncore==0.4.4`.

Use the following command to perform moment retrieval using your own videos and queries.

```
# Make sure you are using the correct version
pip install nncore==0.4.4

python tools/inference.py [--config --checkpoint ]
```

The [checkpoint](https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_qvhighlights-ed516355.pth) trained on QVHighlights using this [config](https://github.com/yeliudev/R2-Tuning/tree/main/configs/qvhighlights/r2_tuning_qvhighlights.py) will be downloaded by default.

## 📦 Model Zoo

We provide multiple pre-trained models and training logs here. All the models were trained on a single NVIDIA A100 80GB GPU and were evaluated using the default metrics of different datasets.

## 📖 Citation

Please kindly cite our paper if you find this project helpful.

```bibtex
@inproceedings{liu2024tuning,
title={$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding},
author={Liu, Ye and He, Jixuan and Li, Wanhua and Kim, Junsik and Wei, Donglai and Pfister, Hanspeter and Chen, Chang Wen},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2024}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yeliudev/r2-tuning

Awesome Lists containing this project

README