An open API service indexing awesome lists of open source software.

https://github.com/yeliudev/r2-tuning

๐ŸŒ€ R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding (ECCV 2024)
https://github.com/yeliudev/r2-tuning

Last synced: 4 months ago
JSON representation

๐ŸŒ€ R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding (ECCV 2024)

Awesome Lists containing this project

README

        

# $\boldsymbol{R^2}$-Tuning

[![arXiv](https://badgen.net/badge/arXiv/2404.00801/red?cache=300)](https://arxiv.org/abs/2404.00801)
[![License](https://badgen.net/badge/License/BSD%203-Clause%20License?color=blue&cache=300)](https://github.com/yeliudev/R2-Tuning/blob/main/LICENSE)
[![Hugging Face Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/yeliudev/R2-Tuning)

[**Installation**](#-installation) | [**Dataset**](#-dataset) | [**Training**](#-training) | [**Evaluation**](#-evaluation) | [**Model Zoo**](#-model-zoo)

This repository maintains the official implementation of the paper **$\boldsymbol{R^2}$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding** by [Ye Liu](https://yeliu.dev/), [Jixuan He](https://openreview.net/profile?id=~Jixuan_He1), [Wanhua Li](https://li-wanhua.github.io/), [Junsik Kim](https://sites.google.com/site/jskimcv/), [Donglai Wei](https://donglaiw.github.io/), [Hanspeter Pfister](https://vcg.seas.harvard.edu/people/), and [Chang Wen Chen](https://web.comp.polyu.edu.hk/chencw/).

## ๐Ÿ”ฅ News

- **[2024.7.2]** Our paper has been accepted by ECCV 2024.
- **[2024.6.16]** Check out our [online demo](https://huggingface.co/spaces/yeliudev/R2-Tuning) on ๐Ÿค— Hugging Face Spaces.
- **[2024.6.15]** Add support for [single video inference](#-single-video-inference).
- **[2024.4.16]** Code and dataset release.
- **[2024.3.31]** Our tech report is available on [arXiv](https://arxiv.org/abs/2404.00801).

## ๐Ÿ”จ Installation

Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.

- CUDA 12.1
- FFmpeg 6.0
- Python 3.12.2
- PyTorch 2.2.1
- [NNCore](https://github.com/yeliudev/nncore) 0.4.2

### Install from source

1. Clone the repository from GitHub.

```shell
git clone https://github.com/yeliudev/R2-Tuning.git
cd R2-Tuning
```

2. Initialize conda environment.

```shell
conda create -n r2-tuning python=3.12 -y
conda activate r2-tuning
```

3. Install dependencies.

```shell
pip install -r requirements.txt
```

## ๐Ÿ”– Dataset

#### Option 1 [Recommended]: Download pre-extracted features from [HuggingFace Hub](https://huggingface.co/yeliudev/R2-Tuning) directly.

```shell
# Prepare datasets in one command
bash tools/prepare_data.sh
```

#### Option 2: Reproduce our data pre-processing pipeline.

1. Download videos from the following links and place them into `data/{dataset}/videos`.

- [QVHighlights](https://nlp.cs.unc.edu/data/jielei/qvh/qvhilights_videos.tar.gz)
- [Ego4D-NLQ](https://ego4d-data.org/)
- [Charades-STA](https://ai2-public-datasets.s3-us-west-2.amazonaws.com/charades/Charades_v1.zip)
- [TACoS](https://datasets.d2.mpi-inf.mpg.de/MPII-Cooking-2/MPII-Cooking-2-videos.tar.gz)
- [YouTube Highlights](https://github.com/aliensunmin/DomainSpecificHighlight)
- [TVSum](https://people.csail.mit.edu/yalesong/tvsum/tvsum50_ver_1_1.tgz)

2. Extract and compress video frames at a fixed frame rate.

```shell
# For QVHighlights, Ego4D-NLQ, TACoS, and TVSum
python tools/extract_frames.py

# For Charades-STA
python tools/extract_frames.py --fps 1.0

# For YouTube Highlights
python tools/extract_frames.py --anno_path data/youtube/youtube_anno.json
```

Arguments of tools/extract_frames.py

- `video_dir` Path to the videos folder
- `--anno_path` Path to the annotation file (only for YouTube Highlights to compute frame rates)
- `--frame_dir` Path to the output extracted frames
- `--size` Side length of the cropped video frames
- `--fps` Frame rate to be used
- `--max_len` The maximum length of each video segment
- `--workers` Number of processes
- `--chunksize` The chunk size for each process

3. Extract features from video frames.

```shell
python tools/extract_feat.py
```

Arguments of tools/extract_feat.py

- `anno_path` Path to the annotation file
- `frame_dir` Path to the extracted frames
- `--video_feat_dir` Path to the output video features
- `--query_feat_dir` Path to the output query features
- `--arch` CLIP architecture to use (`ViT-B/32`, `ViT-B/16`, `ViT-L/14`, `ViT-L/14-336px`)
- `--k` Save the last `k` layers features
- `--batch_size` The batch size to use
- `--workers` Number of workers for data loader

#### The prepared dataset should be in the following structure.

```
R2-Tuning
โ”œโ”€โ”€ configs
โ”œโ”€โ”€ datasets
โ”œโ”€โ”€ models
โ”œโ”€โ”€ tools
โ”œโ”€โ”€ data
โ”‚ โ”œโ”€โ”€ qvhighlights
โ”‚ โ”‚ โ”œโ”€โ”€ frames_224_0.5fps (optional)
โ”‚ โ”‚ โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚ โ”‚ โ””โ”€โ”€ qvhighlights_{train,val,test}.jsonl
โ”‚ โ”œโ”€โ”€ ego4d
โ”‚ โ”‚ โ”œโ”€โ”€ frames_224_0.5fps (optional)
โ”‚ โ”‚ โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚ โ”‚ โ””โ”€โ”€ nlq_{train,val}.jsonl
โ”‚ โ”œโ”€โ”€ charades
โ”‚ โ”‚ โ”œโ”€โ”€ frames_224_1.0fps (optional)
โ”‚ โ”‚ โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚ โ”‚ โ””โ”€โ”€ charades_{train,test}.jsonl
โ”‚ โ”œโ”€โ”€ tacos
โ”‚ โ”‚ โ”œโ”€โ”€ frames_224_0.5fps (optional)
โ”‚ โ”‚ โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚ โ”‚ โ””โ”€โ”€ {train,val,test}.jsonl
โ”‚ โ”œโ”€โ”€ youtube
โ”‚ โ”‚ โ”œโ”€โ”€ frames_224_auto (optional)
โ”‚ โ”‚ โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚ โ”‚ โ””โ”€โ”€ youtube_anno.json
โ”‚ โ””โ”€โ”€ tvsum
โ”‚ โ”œโ”€โ”€ frames_224_0.5fps (optional)
โ”‚ โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚ โ””โ”€โ”€ tvsum_anno.json
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ setup.cfg
โ””โ”€โ”€ ยทยทยท
```

## ๐Ÿ”ฎ Training

Use the following commands to train a model with a specified config.

```shell
# Single GPU
python tools/launch.py

# Multiple GPUs on a single node (elastic)
torchrun --nproc_per_node= tools/launch.py

# Multiple GPUs on multiple nodes (slurm)
srun python tools/launch.py
```

Arguments of tools/launch.py

- `config` The config file to use
- `--checkpoint` The checkpoint file to load from
- `--resume` The checkpoint file to resume from
- `--work_dir` Working directory
- `--eval` Evaluation only
- `--dump` Dump inference outputs
- `--seed` The random seed to use
- `--amp` Whether to use automatic mixed precision training
- `--debug` Debug mode (detect `nan` during training)
- `--launcher` The job launcher to use

Please refer to the [configs](https://github.com/yeliudev/R2-Tuning/tree/main/configs) folder for detailed settings of each model.

## ๐Ÿ† Evaluation

Use the following command to test a model and evaluate results.

```
python tools/launch.py --checkpoint --eval
```

For QVHighlights, you may also dump inference outputs on `val` and `test` splits.

```
python tools/launch.py --checkpoint --dump
```

Then you can pack the `hl_{val,test}_submission.jsonl` files and submit them to [CodaLab](https://codalab.lisn.upsaclay.fr/competitions/6937).

## ๐Ÿ’ป Single Video Inference

> [!WARNING]
> This feature is only compatible with `nncore==0.4.4`.

Use the following command to perform moment retrieval using your own videos and queries.

```
# Make sure you are using the correct version
pip install nncore==0.4.4

python tools/inference.py [--config --checkpoint ]
```

The [checkpoint](https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_qvhighlights-ed516355.pth) trained on QVHighlights using this [config](https://github.com/yeliudev/R2-Tuning/tree/main/configs/qvhighlights/r2_tuning_qvhighlights.py) will be downloaded by default.

## ๐Ÿ“ฆ Model Zoo

We provide multiple pre-trained models and training logs here. All the models were trained on a single NVIDIA A100 80GB GPU and were evaluated using the default metrics of different datasets.


Dataset
Config
[email protected]
[email protected]
[email protected]
MR mAP
HD mAP
Download



QVHighlights


Default

78.71
67.74
51.87
47.86
39.45

model |
log




Ego4D-NLQ


Default

7.18
4.54
2.25
โ€”
โ€”

model |
log




Charades-STA


Default

70.91
60.48
38.66
โ€”
โ€”

model |
log




TACoS


Default

50.96
40.69
25.69
โ€”
โ€”

model |
log




YouTube
Highlights



Dog

โ€”
โ€”
โ€”
โ€”
74.26

model |
log




Gymnastics

โ€”
โ€”
โ€”
โ€”
72.07

model |
log




Parkour

โ€”
โ€”
โ€”
โ€”
81.02

model |
log




Skating

โ€”
โ€”
โ€”
โ€”
76.26

model |
log




Skiing

โ€”
โ€”
โ€”
โ€”
74.36

model |
log




Surfing

โ€”
โ€”
โ€”
โ€”
82.76

model |
log




TVSum


BK

โ€”
โ€”
โ€”
โ€”
91.23

model |
log




BT

โ€”
โ€”
โ€”
โ€”
92.35

model |
log




DS

โ€”
โ€”
โ€”
โ€”
80.88

model |
log




FM

โ€”
โ€”
โ€”
โ€”
75.61

model |
log




GA

โ€”
โ€”
โ€”
โ€”
89.51

model |
log




MS

โ€”
โ€”
โ€”
โ€”
85.01

model |
log




PK

โ€”
โ€”
โ€”
โ€”
82.82

model |
log




PR

โ€”
โ€”
โ€”
โ€”
90.39

model |
log




VT

โ€”
โ€”
โ€”
โ€”
89.81

model |
log




VU

โ€”
โ€”
โ€”
โ€”
85.90

model |
log

## ๐Ÿ“– Citation

Please kindly cite our paper if you find this project helpful.

```bibtex
@inproceedings{liu2024tuning,
title={$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding},
author={Liu, Ye and He, Jixuan and Li, Wanhua and Kim, Junsik and Wei, Donglai and Pfister, Hanspeter and Chen, Chang Wen},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2024}
}
```