{"id":22366321,"url":"https://github.com/microsoft/univl","last_synced_at":"2025-04-05T15:07:54.067Z","repository":{"id":39878238,"uuid":"308532916","full_name":"microsoft/UniVL","owner":"microsoft","description":"An official implementation for \" UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation\"","archived":false,"fork":false,"pushed_at":"2024-07-25T11:07:33.000Z","size":224,"stargazers_count":350,"open_issues_count":15,"forks_count":55,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-04-05T15:07:48.700Z","etag":null,"topics":["alignment","caption","caption-task","coin","joint","localization","msrvtt","multimodal-sentiment-analysis","multimodality","pretrain","pretraining","retrieval-task","segmentation","video","video-language","video-text","video-text-retrieval","youcookii"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2002.06353","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-30T05:22:22.000Z","updated_at":"2025-04-04T02:00:03.000Z","dependencies_parsed_at":"2025-01-17T16:09:41.628Z","dependency_job_id":"12f39f20-807d-4b70-a4cd-94424ba73ea7","html_url":"https://github.com/microsoft/UniVL","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FUniVL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FUniVL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FUniVL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FUniVL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/UniVL/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247353746,"owners_count":20925329,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","caption","caption-task","coin","joint","localization","msrvtt","multimodal-sentiment-analysis","multimodality","pretrain","pretraining","retrieval-task","segmentation","video","video-language","video-text","video-text-retrieval","youcookii"],"created_at":"2024-12-04T18:09:45.410Z","updated_at":"2025-04-05T15:07:54.027Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\nThe implementation of paper [**UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation**](https://arxiv.org/abs/2002.06353). \n\nUniVL is a **video-language pretrain model**. It is designed with four modules and five objectives for both video language understanding and generation tasks. It is also a flexible model for most of the multimodal downstream tasks considering both efficiency and effectiveness.\n\n![alt text](assets/imgs/UniVL_framework.jpg)\n\n# Preliminary\nExecute below scripts in the main folder firstly. It will avoid *download conflict* when doing distributed pretrain.\n```\nmkdir modules/bert-base-uncased\ncd modules/bert-base-uncased/\nwget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt\nmv bert-base-uncased-vocab.txt vocab.txt\nwget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz\ntar -xvf bert-base-uncased.tar.gz\nrm bert-base-uncased.tar.gz\ncd ../../\n```\n\n# Requirements\n- python==3.6.9\n- torch==1.7.0+cu92\n- tqdm\n- boto3\n- requests\n- pandas\n- nlg-eval (Install Java 1.8.0 (or higher) firstly)\n```\nconda create -n py_univl python=3.6.9 tqdm boto3 requests pandas\nconda activate py_univl\npip install torch==1.7.1+cu92\npip install git+https://github.com/Maluuba/nlg-eval.git@master\n```\n\n# Pretrained Weight\n```\nmkdir -p ./weight\nwget -P ./weight https://github.com/microsoft/UniVL/releases/download/v0/univl.pretrained.bin\n```\n\n# Prepare for Evaluation \nGet data for retrieval and caption (with only video input) on YoucookII and MSRVTT.\n## YoucookII\n```\nmkdir -p data\ncd data\nwget https://github.com/microsoft/UniVL/releases/download/v0/youcookii.zip\nunzip youcookii.zip\ncd ..\n```\nNote: you can find `youcookii_data.no_transcript.pickle` in the zip file, which is a version without transcript. The transcript version will not be publicly avaliable due to possible legal issue. Thus, you need to replace `youcookii_data.pickle` with `youcookii_data.no_transcript.pickle` for youcook retrieval task and *caption with only video input* task. S3D feature can be found in `youcookii_videos_features.pickle`. The feature is extract as one 1024-dimension vector per second. More details can be found in [dataloaders](./dataloaders/README.md) and our paper.\n\n## MSRVTT\n```\nmkdir -p data\ncd data\nwget https://github.com/microsoft/UniVL/releases/download/v0/msrvtt.zip\nunzip msrvtt.zip\ncd ..\n```\n\n# Finetune on YoucookII\n## Retrieval\n\n1. Run retrieval task on **YoucookII**\n\n```\nDATATYPE=\"youcook\"\nTRAIN_CSV=\"data/youcookii/youcookii_train.csv\"\nVAL_CSV=\"data/youcookii/youcookii_val.csv\"\nDATA_PATH=\"data/youcookii/youcookii_data.pickle\"\nFEATURES_PATH=\"data/youcookii/youcookii_videos_features.pickle\"\nINIT_MODEL=\"weight/univl.pretrained.bin\"\nOUTPUT_ROOT=\"ckpts\"\n\npython -m torch.distributed.launch --nproc_per_node=4 \\\nmain_task_retrieval.py \\\n--do_train --num_thread_reader=16 \\\n--epochs=5 --batch_size=32 \\\n--n_display=100 \\\n--train_csv ${TRAIN_CSV} \\\n--val_csv ${VAL_CSV} \\\n--data_path ${DATA_PATH} \\\n--features_path ${FEATURES_PATH} \\\n--output_dir ${OUTPUT_ROOT}/ckpt_youcook_retrieval --bert_model bert-base-uncased \\\n--do_lower_case --lr 3e-5 --max_words 48 --max_frames 48 \\\n--batch_size_val 64 --visual_num_hidden_layers 6 \\\n--datatype ${DATATYPE} --init_model ${INIT_MODEL}\n```\nThe results (FT-Joint) are close to `R@1: 0.2269 - R@5: 0.5245 - R@10: 0.6586 - Median R: 5.0`\n\nPlus `--train_sim_after_cross` to train align approach (FT-Align),\n\nThe results (FT-Align) are close to `R@1: 0.2890 - R@5: 0.5760 - R@10: 0.7000 - Median R: 4.0`\n\n2. Run retrieval task on **MSRVTT**\n```\nDATATYPE=\"msrvtt\"\nTRAIN_CSV=\"data/msrvtt/MSRVTT_train.9k.csv\"\nVAL_CSV=\"data/msrvtt/MSRVTT_JSFUSION_test.csv\"\nDATA_PATH=\"data/msrvtt/MSRVTT_data.json\"\nFEATURES_PATH=\"data/msrvtt/msrvtt_videos_features.pickle\"\nINIT_MODEL=\"weight/univl.pretrained.bin\"\nOUTPUT_ROOT=\"ckpts\"\n\npython -m torch.distributed.launch --nproc_per_node=4 \\\nmain_task_retrieval.py \\\n--do_train --num_thread_reader=16 \\\n--epochs=5 --batch_size=128 \\\n--n_display=100 \\\n--train_csv ${TRAIN_CSV} \\\n--val_csv ${VAL_CSV} \\\n--data_path ${DATA_PATH} \\\n--features_path ${FEATURES_PATH} \\\n--output_dir ${OUTPUT_ROOT}/ckpt_msrvtt_retrieval --bert_model bert-base-uncased \\\n--do_lower_case --lr 5e-5 --max_words 48 --max_frames 48 \\\n--batch_size_val 64 --visual_num_hidden_layers 6 \\\n--datatype ${DATATYPE} --expand_msrvtt_sentences --init_model ${INIT_MODEL}\n```\nThe results (FT-Joint) are close to \n`R@1: 0.2720 - R@5: 0.5570 - R@10: 0.6870 - Median R: 4.0`\n\nPlus `--train_sim_after_cross` to train align approach (FT-Align)\n\n## Caption\nRun caption task on **YoucookII**\n\n```\nTRAIN_CSV=\"data/youcookii/youcookii_train.csv\"\nVAL_CSV=\"data/youcookii/youcookii_val.csv\"\nDATA_PATH=\"data/youcookii/youcookii_data.pickle\"\nFEATURES_PATH=\"data/youcookii/youcookii_videos_features.pickle\"\nINIT_MODEL=\"weight/univl.pretrained.bin\"\nOUTPUT_ROOT=\"ckpts\"\n\npython -m torch.distributed.launch --nproc_per_node=4 \\\nmain_task_caption.py \\\n--do_train --num_thread_reader=4 \\\n--epochs=5 --batch_size=16 \\\n--n_display=100 \\\n--train_csv ${TRAIN_CSV} \\\n--val_csv ${VAL_CSV} \\\n--data_path ${DATA_PATH} \\\n--features_path ${FEATURES_PATH} \\\n--output_dir ${OUTPUT_ROOT}/ckpt_youcook_caption --bert_model bert-base-uncased \\\n--do_lower_case --lr 3e-5 --max_words 128 --max_frames 96 \\\n--batch_size_val 64 --visual_num_hidden_layers 6 \\\n--decoder_num_hidden_layers 3 --stage_two \\\n--init_model ${INIT_MODEL}\n```\n\u003eThe results are close to \n```\nBLEU_1: 0.4746, BLEU_2: 0.3355, BLEU_3: 0.2423, BLEU_4: 0.1779\nMETEOR: 0.2261, ROUGE_L: 0.4697, CIDEr: 1.8631\n```\n\nIf using video only as input (`youcookii_data.no_transcript.pickle`),\n\u003eThe results are close to \n```\nBLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117\nMETEOR: 0.1769, ROUGE_L: 0.4049, CIDEr: 1.2725\n```\n\nRun caption task on **MSRVTT**\n\n```\nDATATYPE=\"msrvtt\"\nTRAIN_CSV=\"data/msrvtt/MSRVTT_train.9k.csv\"\nVAL_CSV=\"data/msrvtt/MSRVTT_JSFUSION_test.csv\"\nDATA_PATH=\"data/msrvtt/MSRVTT_data.json\"\nFEATURES_PATH=\"data/msrvtt/msrvtt_videos_features.pickle\"\nINIT_MODEL=\"weight/univl.pretrained.bin\"\nOUTPUT_ROOT=\"ckpts\"\n\npython -m torch.distributed.launch --nproc_per_node=4 \\\nmain_task_caption.py \\\n--do_train --num_thread_reader=4 \\\n--epochs=5 --batch_size=128 \\\n--n_display=100 \\\n--train_csv ${TRAIN_CSV} \\\n--val_csv ${VAL_CSV} \\\n--data_path ${DATA_PATH} \\\n--features_path ${FEATURES_PATH} \\\n--output_dir ${OUTPUT_ROOT}/ckpt_msrvtt_caption --bert_model bert-base-uncased \\\n--do_lower_case --lr 3e-5 --max_words 48 --max_frames 48 \\\n--batch_size_val 32 --visual_num_hidden_layers 6 \\\n--decoder_num_hidden_layers 3 --datatype ${DATATYPE} --stage_two \\\n--init_model ${INIT_MODEL}\n```\n\u003eThe results are close to \n```\nBLEU_1: 0.8051, BLEU_2: 0.6672, BLEU_3: 0.5342, BLEU_4: 0.4179\nMETEOR: 0.2894, ROUGE_L: 0.6078, CIDEr: 0.5004\n```\n\n# Pretrain on HowTo100M\n\n## Format of csv\n```\nvideo_id,feature_file\nZ8xhli297v8,Z8xhli297v8.npy\n...\n```\n\n## Stage I\n```\nROOT_PATH=.\nDATA_PATH=${ROOT_PATH}/data\nSAVE_PATH=${ROOT_PATH}/models\nMODEL_PATH=${ROOT_PATH}/UniVL\npython -m torch.distributed.launch --nproc_per_node=8 \\\n${MODEL_PATH}/main_pretrain.py \\\n --do_pretrain --num_thread_reader=0 --epochs=50 \\\n--batch_size=1920 --n_pair=3 --n_display=100 \\\n--bert_model bert-base-uncased --do_lower_case --lr 1e-4 \\\n--max_words 48 --max_frames 64 --batch_size_val 344 \\\n--output_dir ${SAVE_PATH}/pre_trained/L48_V6_D3_Phase1 \\\n--features_path ${DATA_PATH}/features \\\n--train_csv ${DATA_PATH}/HowTo100M.csv \\\n--data_path ${DATA_PATH}/caption.pickle \\\n--visual_num_hidden_layers 6 --gradient_accumulation_steps 16 \\\n--sampled_use_mil --load_checkpoint\n```\n\n## Stage II\n```\nROOT_PATH=.\nDATA_PATH=${ROOT_PATH}/data\nSAVE_PATH=${ROOT_PATH}/models\nMODEL_PATH=${ROOT_PATH}/UniVL\nINIT_MODEL=\u003cfrom first stage\u003e\npython -m torch.distributed.launch --nproc_per_node=8 \\\n${MODEL_PATH}/main_pretrain.py \\\n--do_pretrain --num_thread_reader=0 --epochs=50 \\\n--batch_size=960 --n_pair=3 --n_display=100 \\\n--bert_model bert-base-uncased --do_lower_case --lr 1e-4 \\\n--max_words 48 --max_frames 64 --batch_size_val 344 \\\n--output_dir ${SAVE_PATH}/pre_trained/L48_V6_D3_Phase2 \\\n--features_path ${DATA_PATH}/features \\\n--train_csv ${DATA_PATH}/HowTo100M.csv \\\n--data_path ${DATA_PATH}/caption.pickle \\\n--visual_num_hidden_layers 6 --decoder_num_hidden_layers 3 \\\n--gradient_accumulation_steps 60 \\\n--stage_two --sampled_use_mil \\\n--pretrain_enhance_vmodal \\\n--load_checkpoint --init_model ${INIT_MODEL}\n```\n\n# Citation\nIf you find UniVL useful in your work, you can cite the following paper:\n```\n@Article{Luo2020UniVL,\n  author  = {Huaishao Luo and Lei Ji and Botian Shi and Haoyang Huang and Nan Duan and Tianrui Li and Jason Li and Taroon Bharti and Ming Zhou},\n  title   = {UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation},\n  journal = {arXiv preprint arXiv:2002.06353},\n  year    = {2020},\n}\n```\n\n# License\nThis project is licensed under the license found in the LICENSE file in the root directory of this source tree.\n\n[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)\n\n# Acknowledgments\nOur code is based on [pytorch-transformers v0.4.0](https://github.com/huggingface/transformers/tree/v0.4.0) and [howto100m](https://github.com/antoine77340/howto100m). We thank the authors for their wonderful open-source efforts.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Funivl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Funivl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Funivl/lists"}