{"id":13563855,"url":"https://github.com/ArrowLuo/CLIP4Clip","last_synced_at":"2025-04-03T20:32:05.725Z","repository":{"id":39567460,"uuid":"357478494","full_name":"ArrowLuo/CLIP4Clip","owner":"ArrowLuo","description":"An official implementation for \"CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval\"","archived":false,"fork":false,"pushed_at":"2024-04-12T14:37:10.000Z","size":1686,"stargazers_count":823,"open_issues_count":27,"forks_count":117,"subscribers_count":12,"default_branch":"master","last_synced_at":"2024-08-01T13:30:12.499Z","etag":null,"topics":["activitynet","clip","didemo","lsmdc","msrvtt","msvd","multimodal","multimodal-learning","multimodality","ranking","retrieval","retrieval-model","search","video-clip-retrieval","video-text-retrieval"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2104.08860","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ArrowLuo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-13T08:25:46.000Z","updated_at":"2024-07-28T06:40:17.000Z","dependencies_parsed_at":"2024-08-01T13:19:51.151Z","dependency_job_id":"d83ef62d-2263-47f7-a521-5f23e217904a","html_url":"https://github.com/ArrowLuo/CLIP4Clip","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArrowLuo%2FCLIP4Clip","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArrowLuo%2FCLIP4Clip/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArrowLuo%2FCLIP4Clip/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArrowLuo%2FCLIP4Clip/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ArrowLuo","download_url":"https://codeload.github.com/ArrowLuo/CLIP4Clip/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223030753,"owners_count":17076494,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["activitynet","clip","didemo","lsmdc","msrvtt","msvd","multimodal","multimodal-learning","multimodality","ranking","retrieval","retrieval-model","search","video-clip-retrieval","video-text-retrieval"],"created_at":"2024-08-01T13:01:23.974Z","updated_at":"2024-11-04T16:31:26.040Z","avatar_url":"https://github.com/ArrowLuo.png","language":"Python","funding_links":[],"categories":["Python","其他_机器视觉"],"sub_categories":["网络服务_其他"],"readme":"# CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval\n\n(**July 28, 2021**) Add ViT-B/16 with an extra `--pretrained_clip_name`\n\n(**Apr. 22, 2021**) First version \n\nThe implementation of paper [**CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval**](https://arxiv.org/abs/2104.08860). \n\nCLIP4Clip is a video-text retrieval model based on [CLIP (ViT-B)](https://github.com/openai/CLIP). We investigate three similarity calculation approaches: parameter-free type, sequential type, and tight type, in this work. The model achieve SOTA results on MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo.\n\n![CLIP4Clip](CLIP4Clip.png)\n\n## Requirement\n```sh\n# From CLIP\nconda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0\npip install ftfy regex tqdm\npip install opencv-python boto3 requests pandas\n```\n\n## Data Preparing\n\n**For MSRVTT**\n\nThe official data and video links can be found in [link](http://ms-multimedia-challenge.com/2017/dataset). \n\nFor the convenience, you can also download the splits and captions by,\n```sh\nwget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip\n```\n\nBesides, the raw videos can be found in [sharing](https://github.com/m-bain/frozen-in-time#-finetuning-benchmarks-msr-vtt) from *Frozen️ in Time*, i.e.,\n```sh\nwget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip\n```\n\n**For MSVD**\n\nRaw videos can be download from [link](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/). \n\nThe splits and `raw_captions` can be found in the wonderful job [collaborative-experts](https://github.com/albanie/collaborative-experts/blob/master/misc/datasets/msvd/README.md). For the convenience, you can also download them by,\n```sh\nwget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msvd_data.zip\n```\n\n**For LSMDC**\n\nYou must obtain permission from MPII to download and use the data. The download link is [here](https://sites.google.com/site/describingmovies/download).\nThe 1000 test clips data is [link](http://www.google.com/url?q=http%3A%2F%2Fdatasets.d2.mpi-inf.mpg.de%2FmovieDescription%2Fprotected%2Flsmdc2016%2FLSMDC16_challenge_1000_publictect.csv\u0026sa=D\u0026sntz=1\u0026usg=AFQjCNGIaGVhCeb6zNfUs2UL1zNzoEtaSg). Read our paper and the [dataloader](./dataloaders/dataloader_lsmdc_retrieval.py) for more information.\n\n**For ActivityNet**\n\nThe official websit has made the full dataset available on Google and Baidu drives, see more information at [here](http://activity-net.org/download.html) . The splits can be found in the job [collaborative-experts](https://github.com/albanie/collaborative-experts/tree/master/misc/datasets/activity-net).\n\n**For DiDeMo**\n\nRaw videos can be download from [LisaAnne/LocalizingMoments](https://github.com/LisaAnne/LocalizingMoments). The splits can be found in the job [collaborative-experts](https://github.com/albanie/collaborative-experts/tree/master/misc/datasets/didemo/README.md).\n\n\n## Compress Video for Speed-up (optional)\n```sh\npython preprocess/compress_video.py --input_root [raw_video_path] --output_root [compressed_video_path]\n```\nThis script will compress the video to *3fps* with width *224* (or height *224*). Modify the variables for your customization.\n\n## How to Run \n\n\u003e`--features_path` is the video root path\n\u003e \n\u003e`--linear_patch` can be set with `2d` or `3d`\n\u003e \n\u003e `--sim_header` can be set with `meanP`, `seqLSTM`, `seqTransf`, or `tightTransf`\n\u003e \n\u003e `--pretrained_clip_name` can be set with `ViT-B/32` or `ViT-B/16`\n\u003e \n\u003e `--resume_model` can be used to reload the saved optimizer state to continuely train the model, **Note**: need to set the corresponding chechpoint via `--init_model` simultaneously. \n\nread our paper for more details on `--linear_patch` and `--sim_header`. Test more hyperparameters for better performance. \n\nDownload CLIP (ViT-B/32) weight,\n```sh\nwget -P ./modules https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt\n```\nor, download CLIP (ViT-B/16) weight,\n```sh\nwget -P ./modules https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt\n```\n\nThen, run\n\n\n*The CLIP (ViT-B/32) is the default setting in the paper, replacing with the ViT-B/16 for better performance.*\n\n### MSRVTT\n\n```sh\nDATA_PATH=[Your MSRVTT data and videos path]\npython -m torch.distributed.launch --nproc_per_node=4 \\\nmain_task_retrieval.py --do_train --num_thread_reader=0 \\\n--epochs=5 --batch_size=128 --n_display=50 \\\n--train_csv ${DATA_PATH}/MSRVTT_train.9k.csv \\\n--val_csv ${DATA_PATH}/MSRVTT_JSFUSION_test.csv \\\n--data_path ${DATA_PATH}/MSRVTT_data.json \\\n--features_path ${DATA_PATH}/MSRVTT_Videos \\\n--output_dir ckpts/ckpt_msrvtt_retrieval_looseType \\\n--lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \\\n--datatype msrvtt --expand_msrvtt_sentences  \\\n--feature_framerate 1 --coef_lr 1e-3 \\\n--freeze_layer_num 0  --slice_framepos 2 \\\n--loose_type --linear_patch 2d --sim_header meanP \\\n--pretrained_clip_name ViT-B/32\n```\n\n### MSVD\n```sh\nDATA_PATH=[Your MSVD data and videos path]\npython -m torch.distributed.launch --nproc_per_node=4 \\\nmain_task_retrieval.py --do_train --num_thread_reader=2 \\\n--epochs=5 --batch_size=128 --n_display=50 \\\n--data_path ${DATA_PATH} \\\n--features_path ${DATA_PATH}/MSVD_Videos \\\n--output_dir ckpts/ckpt_msvd_retrieval_looseType \\\n--lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \\\n--datatype msvd \\\n--feature_framerate 1 --coef_lr 1e-3 \\\n--freeze_layer_num 0 --slice_framepos 2 \\\n--loose_type --linear_patch 2d --sim_header meanP \\\n--pretrained_clip_name ViT-B/32\n```\n\n### LSMDC\n```sh\nDATA_PATH=[Your LSMDC data and videos path]\npython -m torch.distributed.launch --nproc_per_node=4 \\\nmain_task_retrieval.py --do_train --num_thread_reader=2 \\\n--epochs=5 --batch_size=128 --n_display=50 \\\n--data_path ${DATA_PATH} \\\n--features_path ${DATA_PATH}/LSMDC_Videos \\\n--output_dir ckpts/ckpt_lsmdc_retrieval_looseType \\\n--lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \\\n--datatype lsmdc --feature_framerate 1 --coef_lr 1e-3 \\\n--freeze_layer_num 0  --slice_framepos 2 \\\n--loose_type --linear_patch 2d --sim_header meanP \\\n--pretrained_clip_name ViT-B/32\n```\n\n### ActivityNet\nActivityNet is regarded as video-paragraph retrieval in our setting, thus, need more GPUs (or run with multi-node).\n```sh\nDATA_PATH=[Your ActivityNet data and videos path]\npython -m torch.distributed.launch --nproc_per_node=8 \\\nmain_task_retrieval.py --do_train --num_thread_reader=2 \\\n--epochs=5 --batch_size=128 --n_display=50 \\\n--data_path ${DATA_PATH} \\\n--features_path ${DATA_PATH}/Activity_Videos \\\n--output_dir ckpts/ckpt_activity_retrieval_looseType \\\n--lr 1e-4 --max_words 64 --max_frames 64 --batch_size_val 16 \\\n--datatype activity --feature_framerate 1 --coef_lr 1e-3 \\\n--freeze_layer_num 0  --slice_framepos 2 \\\n--loose_type --linear_patch 2d --sim_header meanP \\\n--pretrained_clip_name ViT-B/32\n```\n\n### DiDeMo\nDiDeMo is regarded as video-paragraph retrieval in our setting, thus, need more GPUs (or run with multi-node).\n```sh\nDATA_PATH=[Your DiDeMo data and videos path]\npython -m torch.distributed.launch --nproc_per_node=8 \\\nmain_task_retrieval.py --do_train --num_thread_reader=2 \\\n--epochs=5 --batch_size=128 --n_display=50 \\\n--data_path ${DATA_PATH} \\\n--features_path ${DATA_PATH}/DiDeMo_Videos \\\n--output_dir ckpts/ckpt_didemo_retrieval_looseType \\\n--lr 1e-4 --max_words 64 --max_frames 64 --batch_size_val 16 \\\n--datatype didemo --feature_framerate 1 --coef_lr 1e-3 \\\n--freeze_layer_num 0  --slice_framepos 2 \\\n--loose_type --linear_patch 2d --sim_header meanP \\\n--pretrained_clip_name ViT-B/32\n```\n\n# Citation\nIf you find CLIP4Clip useful in your work, you can cite the following paper:\n```bibtex\n@Article{Luo2021CLIP4Clip,\n  author  = {Huaishao Luo and Lei Ji and Ming Zhong and Yang Chen and Wen Lei and Nan Duan and Tianrui Li},\n  title   = {{CLIP4Clip}: An Empirical Study of CLIP for End to End Video Clip Retrieval},\n  journal = {arXiv preprint arXiv:2104.08860},\n  year    = {2021},\n}\n```\n\n# Acknowledgments\nOur code is based on [CLIP](https://github.com/openai/CLIP) and [UniVL](https://github.com/microsoft/UniVL).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FArrowLuo%2FCLIP4Clip","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FArrowLuo%2FCLIP4Clip","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FArrowLuo%2FCLIP4Clip/lists"}