Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/foolwood/drl
[arXiv22] Disentangled Representation Learning for Text-Video Retrieval
https://github.com/foolwood/drl
clip interaction-nets text-video-search-engine transformer video-retrieval
Last synced: 4 days ago
JSON representation
[arXiv22] Disentangled Representation Learning for Text-Video Retrieval
- Host: GitHub
- URL: https://github.com/foolwood/drl
- Owner: foolwood
- License: apache-2.0
- Created: 2022-04-07T03:18:54.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-04-07T05:36:40.000Z (over 2 years ago)
- Last Synced: 2023-03-06T20:53:04.887Z (over 1 year ago)
- Topics: clip, interaction-nets, text-video-search-engine, transformer, video-retrieval
- Language: Python
- Homepage:
- Size: 6.04 MB
- Stars: 51
- Watchers: 3
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Disentangled Representation Learning for Text-Video Retrieval
[![MSR-VTT](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/disentangled-representation-learning-for-text/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=disentangled-representation-learning-for-text)
[![DiDeMo](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/disentangled-representation-learning-for-text/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=disentangled-representation-learning-for-text)This is a PyTorch implementation of the paper [Disentangled Representation Learning for Text-Video Retrieval](https://arxiv.org/abs/2203.07111):
```
@Article{DRLTVR2022,
author = {Qiang Wang and Yanhao Zhang and Yun Zheng and Pan Pan and Xian-Sheng Hua},
journal = {arXiv:2203.07111},
title = {Disentangled Representation Learning for Text-Video Retrieval},
year = {2022},
}
```### Catalog
- [x] Setup
- [x] Fine-tuning code
- [x] Visualization demo### Setup
#### Setup code environment
```shell
git clone https://github.com/foolwood/DRL.git
cd DRL
conda create -n drl python=3.9
conda activate drl
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html
```#### Download CLIP Model (as pretraining)
```shell
cd tvr/models
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt
```#### Download Datasets
```shell
cd data/MSR-VTT
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip ; unzip MSRVTT.zip
mv MSRVTT/videos/all ./videos ; mv MSRVTT/annotation/MSR_VTT.json ./anns/MSRVTT_data.json
```### Fine-tuning code
- Train on MSR-VTT 1k.
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
main.py --do_train 1 --workers 8 --n_display 50 \
--epochs 5 --lr 1e-4 --coef_lr 1e-3 --batch_size 128 --batch_size_val 128 \
--anno_path data/MSR-VTT/anns --video_path data/MSR-VTT/videos --datatype msrvtt \
--max_words 32 --max_frames 12 --video_framerate 1 \
--base_encoder ViT-B/32 --agg_module seqTransf \
--interaction wti --wti_arch 2 --cdcr 3 --cdcr_alpha1 0.11 --cdcr_alpha2 0.0 --cdcr_lambda 0.001 \
--output_dir ckpts/ckpt_msrvtt_wti_cdcr
```
Reproduce the ablation experiments [scripts](scripts/msrvtt.sh)
configs
feature
gpus
Text-Video
Video-Text
train time (h)
R@1
R@5
R@10
MdR
MnR
R@1
R@5
R@10
MdR
MnR
CLIP4Clip
ViT/B-32
4
42.8
72.1
81.4
2.0
16.3
44.1
70.5
80.5
2.0
11.8
10.5
zero-shot
ViT/B-32
4
31.1
53.7
63.4
4.0
41.6
26.5
50.1
61.7
5.0
39.9
-
Interaction
DP+None
ViT/B-32
4
42.9
70.6
81.4
2.0
15.4
43.0
71.1
81.1
2.0
11.8
2.5
DP+seqTransf
ViT/B-32
4
42.8
71.1
81.1
2.0
15.6
44.1
70.9
80.9
2.0
11.7
2.6
XTI+None
ViT/B-32
4
40.5
71.1
82.6
2.0
13.6
42.7
70.8
80.2
2.0
12.5
14.3
XTI+seqTransf
ViT/B-32
4
42.4
71.3
80.9
2.0
15.2
40.1
69.2
79.6
2.0
15.8
16.8
TI+seqTransf
ViT/B-32
4
44.8
73.0
82.2
2.0
13.4
42.6
72.7
82.8
2.0
9.1
2.6
WTI+seqTransf
ViT/B-32
4
46.6
73.4
83.5
2.0
13.0
45.4
73.4
81.9
2.0
9.2
2.6
Channel DeCorrelation Regularization
DP+seqTransf+CDCR
ViT/B-32
4
43.9
71.1
81.2
2.0
15.3
42.3
70.3
81.1
2.0
11.4
2.6
TI+seqTransf+CDCR
ViT/B-32
4
45.8
73.0
81.9
2.0
12.8
43.3
71.8
82.7
2.0
8.9
2.6
WTI+seqTransf+CDCR
ViT/B-32
4
47.6
73.4
83.3
2.0
12.8
45.1
72.9
83.5
2.0
9.2
2.6
Note: the performances are slight boosts due to new hyperparameters.
### Visualization demo
Run our visualization demo using [matplotlib](demo/show_wti.py) (no GPU needed):
### License
See [LICENSE](LICENSE) for details.
### Acknowledgments
Our code is partly based on [CLIP4Clip](https://github.com/ArrowLuo/CLIP4Clip).