https://github.com/foolwood/drl

[arXiv22] Disentangled Representation Learning for Text-Video Retrieval
https://github.com/foolwood/drl

clip interaction-nets text-video-search-engine transformer video-retrieval

Last synced: about 2 months ago
JSON representation

[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Host: GitHub
URL: https://github.com/foolwood/drl
Owner: foolwood
License: apache-2.0
Created: 2022-04-07T03:18:54.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-04-07T05:36:40.000Z (over 3 years ago)
Last Synced: 2025-04-30T15:27:06.081Z (6 months ago)
Topics: clip, interaction-nets, text-video-search-engine, transformer, video-retrieval
Language: Python
Homepage:
Size: 6.04 MB
Stars: 94
Watchers: 3
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Disentangled Representation Learning for Text-Video Retrieval
[![MSR-VTT](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/disentangled-representation-learning-for-text/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=disentangled-representation-learning-for-text)
[![DiDeMo](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/disentangled-representation-learning-for-text/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=disentangled-representation-learning-for-text)

This is a PyTorch implementation of the paper [Disentangled Representation Learning for Text-Video Retrieval](https://arxiv.org/abs/2203.07111):

```
@Article{DRLTVR2022,
author = {Qiang Wang and Yanhao Zhang and Yun Zheng and Pan Pan and Xian-Sheng Hua},
journal = {arXiv:2203.07111},
title = {Disentangled Representation Learning for Text-Video Retrieval},
year = {2022},
}
```

### Catalog

- [x] Setup
- [x] Fine-tuning code
- [x] Visualization demo

### Setup

#### Setup code environment
```shell
git clone https://github.com/foolwood/DRL.git
cd DRL
conda create -n drl python=3.9
conda activate drl
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html
```

#### Download CLIP Model (as pretraining)

```shell
cd tvr/models
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt
```

#### Download Datasets

```shell
cd data/MSR-VTT
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip ; unzip MSRVTT.zip
mv MSRVTT/videos/all ./videos ; mv MSRVTT/annotation/MSR_VTT.json ./anns/MSRVTT_data.json
```

### Fine-tuning code

- Train on MSR-VTT 1k.

```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
main.py --do_train 1 --workers 8 --n_display 50 \
--epochs 5 --lr 1e-4 --coef_lr 1e-3 --batch_size 128 --batch_size_val 128 \
--anno_path data/MSR-VTT/anns --video_path data/MSR-VTT/videos --datatype msrvtt \
--max_words 32 --max_frames 12 --video_framerate 1 \
--base_encoder ViT-B/32 --agg_module seqTransf \
--interaction wti --wti_arch 2 --cdcr 3 --cdcr_alpha1 0.11 --cdcr_alpha2 0.0 --cdcr_lambda 0.001 \
--output_dir ckpts/ckpt_msrvtt_wti_cdcr
```

Reproduce the ablation experiments [scripts](scripts/msrvtt.sh)

configs

feature
gpus
Text-Video
Video-Text
train time (h)

R@1
R@5
R@10
MdR
MnR
R@1
R@5
R@10
MdR
MnR

CLIP4Clip
ViT/B-32
4
42.8
72.1
81.4
2.0
16.3
44.1
70.5
80.5
2.0
11.8
10.5

zero-shot
ViT/B-32
4
31.1
53.7
63.4
4.0
41.6
26.5
50.1
61.7
5.0
39.9
-

Interaction

DP+None
ViT/B-32
4
42.9
70.6
81.4
2.0
15.4
43.0
71.1
81.1
2.0
11.8
2.5

DP+seqTransf
ViT/B-32
4
42.8
71.1
81.1
2.0
15.6
44.1
70.9
80.9
2.0
11.7
2.6

XTI+None
ViT/B-32
4
40.5
71.1
82.6
2.0
13.6
42.7
70.8
80.2
2.0
12.5
14.3

XTI+seqTransf
ViT/B-32
4
42.4
71.3
80.9
2.0
15.2
40.1
69.2
79.6
2.0
15.8
16.8

TI+seqTransf
ViT/B-32
4
44.8
73.0
82.2
2.0
13.4
42.6
72.7
82.8
2.0
9.1
2.6

WTI+seqTransf
ViT/B-32
4
46.6
73.4
83.5
2.0
13.0
45.4
73.4
81.9
2.0
9.2
2.6

Channel DeCorrelation Regularization

DP+seqTransf+CDCR
ViT/B-32
4
43.9
71.1
81.2
2.0
15.3
42.3
70.3
81.1
2.0
11.4
2.6

TI+seqTransf+CDCR
ViT/B-32
4
45.8
73.0
81.9
2.0
12.8
43.3
71.8
82.7
2.0
8.9
2.6

WTI+seqTransf+CDCR
ViT/B-32
4
47.6
73.4
83.3
2.0
12.8
45.1
72.9
83.5
2.0
9.2
2.6

Note: the performances are slight boosts due to new hyperparameters.

### Visualization demo

Run our visualization demo using [matplotlib](demo/show_wti.py) (no GPU needed):

### License

See [LICENSE](LICENSE) for details.

### Acknowledgments
Our code is partly based on [CLIP4Clip](https://github.com/ArrowLuo/CLIP4Clip).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/foolwood/drl

Awesome Lists containing this project

README