{"id":13563860,"url":"https://github.com/CryhanFang/CLIP2Video","last_synced_at":"2025-04-03T20:32:17.543Z","repository":{"id":42616131,"uuid":"378793604","full_name":"CryhanFang/CLIP2Video","owner":"CryhanFang","description":null,"archived":false,"fork":false,"pushed_at":"2022-12-10T23:51:47.000Z","size":21836,"stargazers_count":231,"open_issues_count":16,"forks_count":28,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-11-04T16:45:57.538Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CryhanFang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-06-21T03:17:10.000Z","updated_at":"2024-10-30T09:01:03.000Z","dependencies_parsed_at":"2023-01-26T12:46:08.477Z","dependency_job_id":null,"html_url":"https://github.com/CryhanFang/CLIP2Video","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryhanFang%2FCLIP2Video","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryhanFang%2FCLIP2Video/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryhanFang%2FCLIP2Video/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryhanFang%2FCLIP2Video/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CryhanFang","download_url":"https://codeload.github.com/CryhanFang/CLIP2Video/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247075223,"owners_count":20879410,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T13:01:24.038Z","updated_at":"2025-04-03T20:32:17.536Z","avatar_url":"https://github.com/CryhanFang.png","language":"Python","funding_links":[],"categories":["Python","其他_机器视觉"],"sub_categories":["网络服务_其他"],"readme":"# CLIP2Video: Mastering Video-Text Retrieval via Image CLIP\n\nThe implementation of paper [**CLIP2Video: Mastering Video-Text Retrieval via Image CLIP**](https://arxiv.org/abs/2106.11097). \n\nCLIP2Video is a video-text retrieval model based on [CLIP (ViT-B/32)](https://github.com/openai/CLIP), which transfers the image-language pre-training model to video-text retrieval in an end-to-end manner. Our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.\n\n![Pipeline](pipeline.png)\n![Blocks](module.png)\n\n\n## Introduction\nThis is the source code of CLIP2Video, \na method for Video-Text Retrieval based on temporal correlations. \nIt is built on top of the CLIP4Clip by ([ Huaishao Luo *et al.*](https://github.com/ArrowLuo/CLIP4Clip)) in PyTorch.\n\n\n## Requirement\n```\npip install -r requirements.txt \n```\n\n## Download data and Pre-trained Model\n\n**Supported public training sets:**\n* MSR-VTT(9k)\n* MSR-VTT(full)\n* MSVD\n* VATEX-English Version\n\n**Supported public testing protocols：**\n* MSR-VTT 1k-A protocol (*SOTA*)\n* MSR-VTT full protocol (*SOTA*)\n* MSVD（*SOTA*）\n* VATEX-English version（*SOTA*）\n\n\n**Download official video:**\nOfficial videos of different data can be found as follows:\n* MSRVTT: [link](http://ms-multimedia-challenge.com/2017/dataset). \n* MSVD: [link](https://www.cs.utexas.edu/users/ml/clamp/videoDescription).\n* VATEX: [link](https://eric-xw.github.io/vatex-website/download.html).\n\n**Pre-process**\n\nTo train and test the above datasets: you should use `sample_frame.py` to transform video into frames.\n~~~\npython sample_frame.py --input_path [raw video path] --output_path [frame path]\n~~~\n\n(*Optional*) The splits and captions can be found in the links of used dataset. For the convenience, you can also use the split in ` data/` directly.\n\n**Download CLIP model**\n\nTo train and test the above datasets based on pre-trained CLIP model, you should visit [CLIP](https://github.com/openai/CLIP) and download [ViT-B/32](https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt).\n\n\n\n## Test Model\n\nWe provide three models trained on MSVD, MSR-VTT and VATEX-English.\n\n|    Model Name         |   checkpoint|\n| :-----------:  | :-----------: |\n|CLIP2Video_MSVD |\t[link](https://drive.google.com/drive/folders/1LKMUZFf9EAxFbGShlA22eUCeGKC8DWx4?usp=sharing)\t|\n|CLIP2Video_MSRVTT9k |\t[link](https://drive.google.com/drive/folders/1a5Dcg8wNh88Z-bxb0ZMV3IJFjtSe7X2A?usp=sharing)\t|\n|CLIP2Video_VATEX |\t[link](https://drive.google.com/drive/folders/15IDB7NdNx6DQx-LcTzvB3JiZZFu9v36l?usp=sharing)\t|\n\n\nTo test the trained model, please refer  `test/`.\n\n(*Optional*) If the path of trained model(`--checkpoint`) doesn't exist, the parameters of basic CLIP (`--clip_path`) will be loaded.\n\n## Main Article Results of CLIP2Video\n\n**T2V:**\n\n|    Protocol         |   R@1     |   R@5     |   R@10    | Median Rank   | Mean Rank |\n| :-----------:  | :-----------: | ---------- | :-----------:  | :-----------: | :-----------: | \n|MSVD |\t47.0\t|   76.8\t|   85.9    |\t    2\t    |   9.6     |\n|MSRVTT-9k |\t45.6\t|   72.6\t|   81.7    |\t    2\t    |   14.6     |\n|MSRVTT-Full |\t29.8\t|   55.5\t|   66.2    |\t    4\t    |   45.5     |\n|Vatex (English) random 1k5 split |\t57.3\t|   90.0\t|   95.5    |\t    1\t    |   3.6    |\n|Vatex (English) HGR split|\t61.2\t|   90.9\t|   95.6    |\t    1\t    |   3.4    |\n\n\n**V2T:**\n\n|          Protocol           |   R@1     |   R@5     |   R@10    | Median Rank   | Mean Rank |\n| :-----------:  | :-----------: | ---------- | :-----------:  | :-----------: | :-----------: | \n|MSVD |\t58.7\t|   85.6\t|   91.6    |\t    1\t    |   4.3     |\n|MSRVTT-9k |\t43.5\t|   72.3\t|   82.1    |\t    2\t    |   10.2     |\n|MSRVTT-Full |\t54.6\t|   82.1\t|   90.8    |\t    1\t    |   5.3     |\n|Vatex (English) random 1k5 split  |\t76.0\t|   97.7\t|   99.9    |\t    1\t    |   1.5     |\n|Vatex (English) HGR split |\t77.9\t|   98.1\t|   99.1    |\t    1\t    |   1.6   |\n\n\n**(Optional:)** Clarification of different results in VATEX:\n1. In our paper, we do not strictly follow [HGR's split](https://arxiv.org/abs/2003.00392), but randomly split the test set by ourselves, which is the split in\n    * data/vatex_data/test1k5_sec_list.txt\n    \n2. In HGR split, we adopt the totally same split following HGR, and the split can be seen as:\n    * data/vatex_data/test_list.txt\n    * data/vatex_data/val_list.txt\n\nWe will revise the results strictly following HGR split for fair comparison in the paper later!\n\n-----------------------\n\n# Citation\nIf you find CLIP2Video useful in your work, you can cite the following paper:\n```\n@article{fang2021clip2video,\n  title={CLIP2Video: Mastering Video-Text Retrieval via Image CLIP},\n  author={Fang, Han and Xiong, Pengfei and Xu, Luhui and Chen, Yu},\n  journal={arXiv preprint arXiv:2106.11097},\n  year={2021}\n}\n```\n\n# Acknowledgments\nSome components of this code implementation are adopted from [CLIP](https://github.com/openai/CLIP) and [CLIP4Clip](https://github.com/ArrowLuo/CLIP4Clip/).\nWe sincerely appreciate for their contributions.\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCryhanFang%2FCLIP2Video","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCryhanFang%2FCLIP2Video","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCryhanFang%2FCLIP2Video/lists"}