{"id":18950245,"url":"https://github.com/salesforce/alpro","last_synced_at":"2025-08-20T03:32:01.097Z","repository":{"id":37290383,"uuid":"437146063","full_name":"salesforce/ALPRO","owner":"salesforce","description":"Align and Prompt: Video-and-Language Pre-training with Entity Prompts","archived":false,"fork":false,"pushed_at":"2022-09-20T04:43:57.000Z","size":317,"stargazers_count":186,"open_issues_count":2,"forks_count":18,"subscribers_count":7,"default_branch":"main","last_synced_at":"2024-12-10T08:42:15.383Z","etag":null,"topics":["prompt-learning","representation-learning","video-language","video-question-answering","video-text-retrieval","vision-and-language"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/salesforce.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":"SECURITY.md","support":null}},"created_at":"2021-12-11T00:01:49.000Z","updated_at":"2024-12-01T05:54:21.000Z","dependencies_parsed_at":"2022-07-12T11:35:04.693Z","dependency_job_id":null,"html_url":"https://github.com/salesforce/ALPRO","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FALPRO","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FALPRO/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FALPRO/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FALPRO/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/salesforce","download_url":"https://codeload.github.com/salesforce/ALPRO/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230388131,"owners_count":18217755,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["prompt-learning","representation-learning","video-language","video-question-answering","video-text-retrieval","vision-and-language"],"created_at":"2024-11-08T13:21:58.315Z","updated_at":"2024-12-19T06:10:21.723Z","avatar_url":"https://github.com/salesforce.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ALPRO (CVPR 22')\n\n## ALPRO is now officially integrated into [LAVIS](https://github.com/salesforce/LAVIS), a one-stop library for language-vision intelligence!\n\n## Align and Prompt: Video-and-Language Pre-training with Entity Prompts [[Paper](https://arxiv.org/abs/2112.09583)]\n\n[Dongxu Li](https://www.linkedin.com/in/dongxu-li-a8a035110/), [Junnan Li](https://sites.google.com/site/junnanlics), [Hongdong Li](http://users.cecs.anu.edu.au/~hongdong/), [Juan Carlos Niebles](http://www.niebles.net/), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)\n\n\u003cimg src=\"pics/teaser.jpg\" width=\"500\"\u003e\n\nOfficial PyTorch code for ALPRO. This repository supports pre-training as well as finetuning on \n- Text-Video Retrieval on MSRVTT and DiDeMo.\n- Video Question Anwsering on MSRVTT and MSVD.\n\n## Requirements\nOur implementation is tested on Ubuntu 20.04.1 with NVIDIA A100 GPUs. Supports for other platforms and hardwares are possible with no warrant. To install the required packages:\n\n```bash\ncd env \u0026\u0026 bash install_pkg.sh\n```\n\n## Data Preparation \n1. Download Annotations and Pre-trained Checkpoints\n    - [Text annotations](https://storage.googleapis.com/sfr-vision-language-research/ALPRO/data.zip)\n    - [Checkpoints of pre-trained model and finetuned model](https://storage.googleapis.com/sfr-vision-language-research/ALPRO/output.zip)\n    - [Externel resources](https://storage.googleapis.com/sfr-vision-language-research/ALPRO/ext.zip)\n    - unzip `data.zip`, `output.zip`, `ext.zip` under `ALPRO/`.\n \n2. Download raw videos of downstream datasets.\n   - MSRVTT:\n     - download train_val_videos.zip and test_videos.zip from e.g. [here](https://www.mediafire.com/folder/h14iarbs62e7p/shared).\n     - check md5sum:\n\n        ```bash\n        51f2394d279cf84f1642defd9a651e6f  train_val_videos.zip\n        0af68454cec9d586e92805739f3911d0  test_videos.zip\n        ```\n     - unzip all the videos into `data/msrvtt_ret/videos` (10k in total).\n     - create the following soft link:\n\n        ```bash\n        ln -s data/msrvtt_ret/videos data/msrvtt_qa/videos```\n    - MSVD:\n      - download from official release:\n  \n        ```bash\n        wget -nc https://www.cs.utexas.edu/users/ml/clamp/videoDescription/YouTubeClips.tar\n        ```\n      - check md5sum:\n      \n        ```bash\n        9bdb20fcf14d59524a6febca9f6a8d89  YouTubeClips.tar\n        ```\n      - unzip all the videos to `data/msvd_qa/videos` (1,970 videos in total).\n        \n        ```bash\n        mkdir data/msvd_qa/videos/ \n        tar xvf YouTubeClips.tar -C data/msvd_qa/videos --strip-components=1\n        ```\n    - DiDeMo:\n       - Following [instructions](https://github.com/LisaAnne/LocalizingMoments/blob/master/README.md) and download from the official release [here](https://drive.google.com/drive/u/1/folders/1_oyJ5rQiZboipbMl6tkhY8v0s9zDkvJc);\n       - unzip all the videos into `data/didemo_ret/videos`.\n       - Note there might be a couple videos missing. See [here](https://github.com/LisaAnne/LocalizingMoments/blob/master/README.md#getting-the-videos) to download. However, as they account for a small portion of training set, you may feel safe to ignore.\n       - Convert all the DiDeMo videos into `*.mp4` format using e.g. [`ffmpeg`](https://askubuntu.com/questions/396883/how-to-simply-convert-video-files-i-e-mkv-to-mp4).\n       - We obtained 10,463 videos following these steps (with one video `77807177@N00_5753455690_1e04ccb364` missing).\n\n\n\n  3. The directory is expected to be in the structure below:\n      ```bash\n      .\n      |-config_release  # configuration files\n      |-data  # text annotations and raw videos\n      |---didemo_ret\n      |-----txt\n      |-----videos\n      |---msrvtt_qa/...\n      |---msrvtt_ret/...\n      |---msvd_qa/...\n      |-env  # scripts to install packages\n      |-ext  # external resources, e.g. bert tokenizer\n      |-output  # checkpoints for pre-trained/finetuned models\n      |---downstreams\n      |-----didemo_ret\n      |-------public\n      |---------ckpt # official finetuned checkpoints\n      |---------log # inference log\n      |---------results_test\n      |-----------step_best_1_mean\n      |-----msrvtt_qa/...\n      |-----msrvtt_ret/...\n      |-----msvd_qa/...\n      |-run_scripts  # bash scripts to launch experiments\n      |-src  # source code\n      ```\n\n## Inference with Official Checkpoints\n\n  ```bash\n  cd run_scripts\n  bash inf_msrvtt_ret.sh\n  # {'text2video': {'r1': 33.9, 'r5': 60.7, 'r10': 73.2, 'medianR': 3.0, 'meanR': 27.404}}\n  bash inf_didemo_ret.sh\n  # {'text2video': {'r1': 35.9, 'r5': 67.5, 'r10': 78.8, 'medianR': 3.0, 'meanR': 19.125}}\n  bash inf_msrvtt_qa.sh\n  # {'ratios': {'what_ratio': [68.48, 49872], 'who_ratio': [27.99, 20385], 'how_ratio': [2.25, 1640], 'where_ratio': [0.34, 250], 'when_ratio': [0.93, 677]}, 'overall_acc': 42.12, 'what_acc': 36.05, 'who_acc': 52.24, 'how_acc': 85.67, 'where_acc': 42.8, 'when_acc': 78.88}\n  bash inf_msvd_qa.sh\n  # {'ratios': {'what_ratio': [61.93, 8150], 'who_ratio': [34.6, 4554], 'how_ratio': [2.81, 370], 'where_ratio': [0.21, 28], 'when_ratio': [0.44, 58]}, 'overall_acc': 45.91, 'what_acc': 37.02, 'who_acc': 58.59, 'how_acc': 81.62, 'where_acc': 46.43, 'when_acc': 72.41}\n  ```\n\n\n## Downstream Task Finetuning\n  - To finetune on downstream tasks with the pre-trained checkpoint `output/pretrain/alpro_pretrained_ckpt.pt`\n\n    ```bash\n    cd run_scripts\n    bash ft_msrvtt_ret.sh\n    bash ft_didemo_ret.sh\n    bash ft_msrvtt_qa.sh\n    bash ft_msvd_qa.sh\n    ```\n  \n    For example, with MSRVTT retrieval:\n    ```bash\n    cd ALPRO/\n\n    export PYTHONPATH=\"$PYTHONPATH:$PWD\"\n    echo $PYTHONPATH\n\n    CONFIG_PATH='config_release/msrvtt_ret.json'\n\n    horovodrun -np 8 python src/tasks/run_video_retrieval.py \\ # change -np to GPUs numbers.\n        --config $CONFIG_PATH \\\n        --output_dir /export/home/workspace/experiments/alpro/finetune/msrvtt_ret/$(date '+%Y%m%d%H%M%S')  # change to your local path to store finetuning ckpts and logs \n    ``` \n - Run inference with locally-finetuned checkpoints.\n   ```bash\n    cd ALPRO/\n\n    export PYTHONPATH=\"$PYTHONPATH:$PWD\"\n    echo $PYTHONPATH\n\n    STEP='best'\n\n    CONFIG_PATH='config_release/msrvtt_ret.json'\n    OUTPUT_DIR='[INPUT_YOUR_OUTPUT_PATH_HERE]'\n\n    TXT_DB='data/msrvtt_ret/txt/test.jsonl'\n    IMG_DB='data/msrvtt_ret/videos'\n\n    horovodrun -np 8 python src/tasks/run_video_retrieval.py \\\n        --do_inference 1 \\\n        --inference_split test \\\n        --inference_model_step $STEP \\\n        --inference_txt_db $TXT_DB \\\n        --inference_img_db $IMG_DB \\\n        --inference_batch_size 64 \\\n        --output_dir $OUTPUT_DIR \\\n        --config $CONFIG_PATH\n   ```  \n   - `OUTPUT_DIR` is the path after the `--output_dir` option in the finetuning script.\n   - `$STEP` is a string, which tells the script to use the checkpoint `$OUTPUT_DIR/ckpt/model_step_$STEP.pt` for inference. \n\n\n## Pretraining\n1. Download [WebVid2M](https://github.com/m-bain/frozen-in-time) and [CC-3M](https://github.com/igorbrigadir/DownloadConceptualCaptions).\n  \n    - Put WebVid2M videos under `data/webvid2m`;\n    - 💡 we downsample webvid2m videos to 10% of the original FPS to speed-up video loading;\n    - change `data/cc3m/txt/cc3m.json` with local image paths.\n\n2. Training Prompter:\n    ```bash\n    cd run_scripts \u0026\u0026 bash pt_prompter.sh\n    ```   \n\n3. Training video-language model: \n    ```bash\n    cd run_scripts \u0026\u0026 bash pt_alpro.sh\n    ```\n    If you would like to use custom prompter weight, please change `teacher_weights_path` in `config_release/pretrain_alpro.json`\n4. To finetune with pre-trained checkpoints, please change `e2e_weights_path` in the finetuning config files, e.g. `config_release/msrvtt_ret.json`.\n\n\n## Citation\n\nIf you find ALPRO useful for your research, please consider citing:\n```bibtex\n  @inproceedings{li2021align,\n    title={Align and Prompt: Video-and-Language Pre-training with Entity Prompts},\n    author={Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C.H. Hoi},\n    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},\n    year={2022}\n  }\n```\n\n## Acknowledgement\nWe thank members at Salesforce Research for their helpful discussions.\n\nThe implementation of ALPRO relies on resources from [ClipBERT](https://github.com/jayleicn/ClipBERT),\n[transformers](https://github.com/huggingface/transformers), \n[TimeSformer](https://github.com/facebookresearch/TimeSformer/tree/main/timesformer/models), \nThe code is implemented using [PyTorch](https://github.com/pytorch/pytorch), \nwith multi-GPU support from [Horovod](https://github.com/horovod/horovod) and [gradient-checkpoint](https://github.com/csrhddlam/pytorch-checkpoint).  We thank the original authors for their open-sourcing and encourage ALPRO users to cite their works when applicable.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2Falpro","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsalesforce%2Falpro","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2Falpro/lists"}