https://github.com/salesforce/alpro
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
https://github.com/salesforce/alpro
prompt-learning representation-learning video-language video-question-answering video-text-retrieval vision-and-language
Last synced: 5 months ago
JSON representation
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
- Host: GitHub
- URL: https://github.com/salesforce/alpro
- Owner: salesforce
- License: bsd-3-clause
- Created: 2021-12-11T00:01:49.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-09-20T04:43:57.000Z (over 2 years ago)
- Last Synced: 2024-12-10T08:42:15.383Z (5 months ago)
- Topics: prompt-learning, representation-learning, video-language, video-question-answering, video-text-retrieval, vision-and-language
- Language: Python
- Homepage:
- Size: 310 KB
- Stars: 186
- Watchers: 7
- Forks: 18
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: CODEOWNERS
- Security: SECURITY.md
Awesome Lists containing this project
README
# ALPRO (CVPR 22')
## ALPRO is now officially integrated into [LAVIS](https://github.com/salesforce/LAVIS), a one-stop library for language-vision intelligence!
## Align and Prompt: Video-and-Language Pre-training with Entity Prompts [[Paper](https://arxiv.org/abs/2112.09583)]
[Dongxu Li](https://www.linkedin.com/in/dongxu-li-a8a035110/), [Junnan Li](https://sites.google.com/site/junnanlics), [Hongdong Li](http://users.cecs.anu.edu.au/~hongdong/), [Juan Carlos Niebles](http://www.niebles.net/), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)
Official PyTorch code for ALPRO. This repository supports pre-training as well as finetuning on
- Text-Video Retrieval on MSRVTT and DiDeMo.
- Video Question Anwsering on MSRVTT and MSVD.## Requirements
Our implementation is tested on Ubuntu 20.04.1 with NVIDIA A100 GPUs. Supports for other platforms and hardwares are possible with no warrant. To install the required packages:```bash
cd env && bash install_pkg.sh
```## Data Preparation
1. Download Annotations and Pre-trained Checkpoints
- [Text annotations](https://storage.googleapis.com/sfr-vision-language-research/ALPRO/data.zip)
- [Checkpoints of pre-trained model and finetuned model](https://storage.googleapis.com/sfr-vision-language-research/ALPRO/output.zip)
- [Externel resources](https://storage.googleapis.com/sfr-vision-language-research/ALPRO/ext.zip)
- unzip `data.zip`, `output.zip`, `ext.zip` under `ALPRO/`.
2. Download raw videos of downstream datasets.
- MSRVTT:
- download train_val_videos.zip and test_videos.zip from e.g. [here](https://www.mediafire.com/folder/h14iarbs62e7p/shared).
- check md5sum:```bash
51f2394d279cf84f1642defd9a651e6f train_val_videos.zip
0af68454cec9d586e92805739f3911d0 test_videos.zip
```
- unzip all the videos into `data/msrvtt_ret/videos` (10k in total).
- create the following soft link:```bash
ln -s data/msrvtt_ret/videos data/msrvtt_qa/videos```
- MSVD:
- download from official release:
```bash
wget -nc https://www.cs.utexas.edu/users/ml/clamp/videoDescription/YouTubeClips.tar
```
- check md5sum:
```bash
9bdb20fcf14d59524a6febca9f6a8d89 YouTubeClips.tar
```
- unzip all the videos to `data/msvd_qa/videos` (1,970 videos in total).
```bash
mkdir data/msvd_qa/videos/
tar xvf YouTubeClips.tar -C data/msvd_qa/videos --strip-components=1
```
- DiDeMo:
- Following [instructions](https://github.com/LisaAnne/LocalizingMoments/blob/master/README.md) and download from the official release [here](https://drive.google.com/drive/u/1/folders/1_oyJ5rQiZboipbMl6tkhY8v0s9zDkvJc);
- unzip all the videos into `data/didemo_ret/videos`.
- Note there might be a couple videos missing. See [here](https://github.com/LisaAnne/LocalizingMoments/blob/master/README.md#getting-the-videos) to download. However, as they account for a small portion of training set, you may feel safe to ignore.
- Convert all the DiDeMo videos into `*.mp4` format using e.g. [`ffmpeg`](https://askubuntu.com/questions/396883/how-to-simply-convert-video-files-i-e-mkv-to-mp4).
- We obtained 10,463 videos following these steps (with one video `77807177@N00_5753455690_1e04ccb364` missing).3. The directory is expected to be in the structure below:
```bash
.
|-config_release # configuration files
|-data # text annotations and raw videos
|---didemo_ret
|-----txt
|-----videos
|---msrvtt_qa/...
|---msrvtt_ret/...
|---msvd_qa/...
|-env # scripts to install packages
|-ext # external resources, e.g. bert tokenizer
|-output # checkpoints for pre-trained/finetuned models
|---downstreams
|-----didemo_ret
|-------public
|---------ckpt # official finetuned checkpoints
|---------log # inference log
|---------results_test
|-----------step_best_1_mean
|-----msrvtt_qa/...
|-----msrvtt_ret/...
|-----msvd_qa/...
|-run_scripts # bash scripts to launch experiments
|-src # source code
```## Inference with Official Checkpoints
```bash
cd run_scripts
bash inf_msrvtt_ret.sh
# {'text2video': {'r1': 33.9, 'r5': 60.7, 'r10': 73.2, 'medianR': 3.0, 'meanR': 27.404}}
bash inf_didemo_ret.sh
# {'text2video': {'r1': 35.9, 'r5': 67.5, 'r10': 78.8, 'medianR': 3.0, 'meanR': 19.125}}
bash inf_msrvtt_qa.sh
# {'ratios': {'what_ratio': [68.48, 49872], 'who_ratio': [27.99, 20385], 'how_ratio': [2.25, 1640], 'where_ratio': [0.34, 250], 'when_ratio': [0.93, 677]}, 'overall_acc': 42.12, 'what_acc': 36.05, 'who_acc': 52.24, 'how_acc': 85.67, 'where_acc': 42.8, 'when_acc': 78.88}
bash inf_msvd_qa.sh
# {'ratios': {'what_ratio': [61.93, 8150], 'who_ratio': [34.6, 4554], 'how_ratio': [2.81, 370], 'where_ratio': [0.21, 28], 'when_ratio': [0.44, 58]}, 'overall_acc': 45.91, 'what_acc': 37.02, 'who_acc': 58.59, 'how_acc': 81.62, 'where_acc': 46.43, 'when_acc': 72.41}
```## Downstream Task Finetuning
- To finetune on downstream tasks with the pre-trained checkpoint `output/pretrain/alpro_pretrained_ckpt.pt````bash
cd run_scripts
bash ft_msrvtt_ret.sh
bash ft_didemo_ret.sh
bash ft_msrvtt_qa.sh
bash ft_msvd_qa.sh
```
For example, with MSRVTT retrieval:
```bash
cd ALPRO/export PYTHONPATH="$PYTHONPATH:$PWD"
echo $PYTHONPATHCONFIG_PATH='config_release/msrvtt_ret.json'
horovodrun -np 8 python src/tasks/run_video_retrieval.py \ # change -np to GPUs numbers.
--config $CONFIG_PATH \
--output_dir /export/home/workspace/experiments/alpro/finetune/msrvtt_ret/$(date '+%Y%m%d%H%M%S') # change to your local path to store finetuning ckpts and logs
```
- Run inference with locally-finetuned checkpoints.
```bash
cd ALPRO/export PYTHONPATH="$PYTHONPATH:$PWD"
echo $PYTHONPATHSTEP='best'
CONFIG_PATH='config_release/msrvtt_ret.json'
OUTPUT_DIR='[INPUT_YOUR_OUTPUT_PATH_HERE]'TXT_DB='data/msrvtt_ret/txt/test.jsonl'
IMG_DB='data/msrvtt_ret/videos'horovodrun -np 8 python src/tasks/run_video_retrieval.py \
--do_inference 1 \
--inference_split test \
--inference_model_step $STEP \
--inference_txt_db $TXT_DB \
--inference_img_db $IMG_DB \
--inference_batch_size 64 \
--output_dir $OUTPUT_DIR \
--config $CONFIG_PATH
```
- `OUTPUT_DIR` is the path after the `--output_dir` option in the finetuning script.
- `$STEP` is a string, which tells the script to use the checkpoint `$OUTPUT_DIR/ckpt/model_step_$STEP.pt` for inference.## Pretraining
1. Download [WebVid2M](https://github.com/m-bain/frozen-in-time) and [CC-3M](https://github.com/igorbrigadir/DownloadConceptualCaptions).
- Put WebVid2M videos under `data/webvid2m`;
- 💡 we downsample webvid2m videos to 10% of the original FPS to speed-up video loading;
- change `data/cc3m/txt/cc3m.json` with local image paths.2. Training Prompter:
```bash
cd run_scripts && bash pt_prompter.sh
```3. Training video-language model:
```bash
cd run_scripts && bash pt_alpro.sh
```
If you would like to use custom prompter weight, please change `teacher_weights_path` in `config_release/pretrain_alpro.json`
4. To finetune with pre-trained checkpoints, please change `e2e_weights_path` in the finetuning config files, e.g. `config_release/msrvtt_ret.json`.## Citation
If you find ALPRO useful for your research, please consider citing:
```bibtex
@inproceedings{li2021align,
title={Align and Prompt: Video-and-Language Pre-training with Entity Prompts},
author={Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C.H. Hoi},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2022}
}
```## Acknowledgement
We thank members at Salesforce Research for their helpful discussions.The implementation of ALPRO relies on resources from [ClipBERT](https://github.com/jayleicn/ClipBERT),
[transformers](https://github.com/huggingface/transformers),
[TimeSformer](https://github.com/facebookresearch/TimeSformer/tree/main/timesformer/models),
The code is implemented using [PyTorch](https://github.com/pytorch/pytorch),
with multi-GPU support from [Horovod](https://github.com/horovod/horovod) and [gradient-checkpoint](https://github.com/csrhddlam/pytorch-checkpoint). We thank the original authors for their open-sourcing and encourage ALPRO users to cite their works when applicable.