https://github.com/yeliudev/r2-tuning
๐ R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding (ECCV 2024)
https://github.com/yeliudev/r2-tuning
Last synced: 4 months ago
JSON representation
๐ R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding (ECCV 2024)
- Host: GitHub
- URL: https://github.com/yeliudev/r2-tuning
- Owner: yeliudev
- License: bsd-3-clause
- Created: 2024-04-02T06:00:18.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-02T04:19:07.000Z (11 months ago)
- Last Synced: 2025-01-17T04:10:29.656Z (4 months ago)
- Language: Python
- Homepage: http://arxiv.org/abs/2404.00801
- Size: 607 KB
- Stars: 71
- Watchers: 7
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# $\boldsymbol{R^2}$-Tuning
[](https://arxiv.org/abs/2404.00801)
[](https://github.com/yeliudev/R2-Tuning/blob/main/LICENSE)
[](https://huggingface.co/spaces/yeliudev/R2-Tuning)[**Installation**](#-installation) | [**Dataset**](#-dataset) | [**Training**](#-training) | [**Evaluation**](#-evaluation) | [**Model Zoo**](#-model-zoo)
This repository maintains the official implementation of the paper **$\boldsymbol{R^2}$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding** by [Ye Liu](https://yeliu.dev/), [Jixuan He](https://openreview.net/profile?id=~Jixuan_He1), [Wanhua Li](https://li-wanhua.github.io/), [Junsik Kim](https://sites.google.com/site/jskimcv/), [Donglai Wei](https://donglaiw.github.io/), [Hanspeter Pfister](https://vcg.seas.harvard.edu/people/), and [Chang Wen Chen](https://web.comp.polyu.edu.hk/chencw/).
## ๐ฅ News
- **[2024.7.2]** Our paper has been accepted by ECCV 2024.
- **[2024.6.16]** Check out our [online demo](https://huggingface.co/spaces/yeliudev/R2-Tuning) on ๐ค Hugging Face Spaces.
- **[2024.6.15]** Add support for [single video inference](#-single-video-inference).
- **[2024.4.16]** Code and dataset release.
- **[2024.3.31]** Our tech report is available on [arXiv](https://arxiv.org/abs/2404.00801).## ๐จ Installation
Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.
- CUDA 12.1
- FFmpeg 6.0
- Python 3.12.2
- PyTorch 2.2.1
- [NNCore](https://github.com/yeliudev/nncore) 0.4.2### Install from source
1. Clone the repository from GitHub.
```shell
git clone https://github.com/yeliudev/R2-Tuning.git
cd R2-Tuning
```2. Initialize conda environment.
```shell
conda create -n r2-tuning python=3.12 -y
conda activate r2-tuning
```3. Install dependencies.
```shell
pip install -r requirements.txt
```## ๐ Dataset
#### Option 1 [Recommended]: Download pre-extracted features from [HuggingFace Hub](https://huggingface.co/yeliudev/R2-Tuning) directly.
```shell
# Prepare datasets in one command
bash tools/prepare_data.sh
```#### Option 2: Reproduce our data pre-processing pipeline.
1. Download videos from the following links and place them into `data/{dataset}/videos`.
- [QVHighlights](https://nlp.cs.unc.edu/data/jielei/qvh/qvhilights_videos.tar.gz)
- [Ego4D-NLQ](https://ego4d-data.org/)
- [Charades-STA](https://ai2-public-datasets.s3-us-west-2.amazonaws.com/charades/Charades_v1.zip)
- [TACoS](https://datasets.d2.mpi-inf.mpg.de/MPII-Cooking-2/MPII-Cooking-2-videos.tar.gz)
- [YouTube Highlights](https://github.com/aliensunmin/DomainSpecificHighlight)
- [TVSum](https://people.csail.mit.edu/yalesong/tvsum/tvsum50_ver_1_1.tgz)2. Extract and compress video frames at a fixed frame rate.
```shell
# For QVHighlights, Ego4D-NLQ, TACoS, and TVSum
python tools/extract_frames.py# For Charades-STA
python tools/extract_frames.py --fps 1.0# For YouTube Highlights
python tools/extract_frames.py --anno_path data/youtube/youtube_anno.json
```Arguments of
tools/extract_frames.py
- `video_dir` Path to the videos folder
- `--anno_path` Path to the annotation file (only for YouTube Highlights to compute frame rates)
- `--frame_dir` Path to the output extracted frames
- `--size` Side length of the cropped video frames
- `--fps` Frame rate to be used
- `--max_len` The maximum length of each video segment
- `--workers` Number of processes
- `--chunksize` The chunk size for each process3. Extract features from video frames.
```shell
python tools/extract_feat.py
```Arguments of
tools/extract_feat.py
- `anno_path` Path to the annotation file
- `frame_dir` Path to the extracted frames
- `--video_feat_dir` Path to the output video features
- `--query_feat_dir` Path to the output query features
- `--arch` CLIP architecture to use (`ViT-B/32`, `ViT-B/16`, `ViT-L/14`, `ViT-L/14-336px`)
- `--k` Save the last `k` layers features
- `--batch_size` The batch size to use
- `--workers` Number of workers for data loader#### The prepared dataset should be in the following structure.
```
R2-Tuning
โโโ configs
โโโ datasets
โโโ models
โโโ tools
โโโ data
โ โโโ qvhighlights
โ โ โโโ frames_224_0.5fps (optional)
โ โ โโโ clip_b32_{vid,txt}_k4
โ โ โโโ qvhighlights_{train,val,test}.jsonl
โ โโโ ego4d
โ โ โโโ frames_224_0.5fps (optional)
โ โ โโโ clip_b32_{vid,txt}_k4
โ โ โโโ nlq_{train,val}.jsonl
โ โโโ charades
โ โ โโโ frames_224_1.0fps (optional)
โ โ โโโ clip_b32_{vid,txt}_k4
โ โ โโโ charades_{train,test}.jsonl
โ โโโ tacos
โ โ โโโ frames_224_0.5fps (optional)
โ โ โโโ clip_b32_{vid,txt}_k4
โ โ โโโ {train,val,test}.jsonl
โ โโโ youtube
โ โ โโโ frames_224_auto (optional)
โ โ โโโ clip_b32_{vid,txt}_k4
โ โ โโโ youtube_anno.json
โ โโโ tvsum
โ โโโ frames_224_0.5fps (optional)
โ โโโ clip_b32_{vid,txt}_k4
โ โโโ tvsum_anno.json
โโโ README.md
โโโ setup.cfg
โโโ ยทยทยท
```## ๐ฎ Training
Use the following commands to train a model with a specified config.
```shell
# Single GPU
python tools/launch.py# Multiple GPUs on a single node (elastic)
torchrun --nproc_per_node= tools/launch.py# Multiple GPUs on multiple nodes (slurm)
srun python tools/launch.py
```Arguments of
tools/launch.py
- `config` The config file to use
- `--checkpoint` The checkpoint file to load from
- `--resume` The checkpoint file to resume from
- `--work_dir` Working directory
- `--eval` Evaluation only
- `--dump` Dump inference outputs
- `--seed` The random seed to use
- `--amp` Whether to use automatic mixed precision training
- `--debug` Debug mode (detect `nan` during training)
- `--launcher` The job launcher to usePlease refer to the [configs](https://github.com/yeliudev/R2-Tuning/tree/main/configs) folder for detailed settings of each model.
## ๐ Evaluation
Use the following command to test a model and evaluate results.
```
python tools/launch.py --checkpoint --eval
```For QVHighlights, you may also dump inference outputs on `val` and `test` splits.
```
python tools/launch.py --checkpoint --dump
```Then you can pack the `hl_{val,test}_submission.jsonl` files and submit them to [CodaLab](https://codalab.lisn.upsaclay.fr/competitions/6937).
## ๐ป Single Video Inference
> [!WARNING]
> This feature is only compatible with `nncore==0.4.4`.Use the following command to perform moment retrieval using your own videos and queries.
```
# Make sure you are using the correct version
pip install nncore==0.4.4python tools/inference.py [--config --checkpoint ]
```The [checkpoint](https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_qvhighlights-ed516355.pth) trained on QVHighlights using this [config](https://github.com/yeliudev/R2-Tuning/tree/main/configs/qvhighlights/r2_tuning_qvhighlights.py) will be downloaded by default.
## ๐ฆ Model Zoo
We provide multiple pre-trained models and training logs here. All the models were trained on a single NVIDIA A100 80GB GPU and were evaluated using the default metrics of different datasets.
Dataset
Config
[email protected]
[email protected]
[email protected]
MR mAP
HD mAP
Download
QVHighlights
Default
78.71
67.74
51.87
47.86
39.45
model |
log
Ego4D-NLQ
Default
7.18
4.54
2.25
โ
โ
model |
log
Charades-STA
Default
70.91
60.48
38.66
โ
โ
model |
log
TACoS
Default
50.96
40.69
25.69
โ
โ
model |
log
YouTube
Highlights
Dog
โ
โ
โ
โ
74.26
model |
log
Gymnastics
โ
โ
โ
โ
72.07
model |
log
Parkour
โ
โ
โ
โ
81.02
model |
log
Skating
โ
โ
โ
โ
76.26
model |
log
Skiing
โ
โ
โ
โ
74.36
model |
log
Surfing
โ
โ
โ
โ
82.76
model |
log
TVSum
BK
โ
โ
โ
โ
91.23
model |
log
BT
โ
โ
โ
โ
92.35
model |
log
DS
โ
โ
โ
โ
80.88
model |
log
FM
โ
โ
โ
โ
75.61
model |
log
GA
โ
โ
โ
โ
89.51
model |
log
MS
โ
โ
โ
โ
85.01
model |
log
PK
โ
โ
โ
โ
82.82
model |
log
PR
โ
โ
โ
โ
90.39
model |
log
VT
โ
โ
โ
โ
89.81
model |
log
VU
โ
โ
โ
โ
85.90
model |
log
## ๐ Citation
Please kindly cite our paper if you find this project helpful.
```bibtex
@inproceedings{liu2024tuning,
title={$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding},
author={Liu, Ye and He, Jixuan and Li, Wanhua and Kim, Junsik and Wei, Donglai and Pfister, Hanspeter and Chen, Chang Wen},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2024}
}
```