{"id":13958463,"url":"https://github.com/linjieli222/HERO","last_synced_at":"2025-07-21T00:30:59.326Z","repository":{"id":45599680,"uuid":"299721314","full_name":"linjieli222/HERO","owner":"linjieli222","description":"Research code for EMNLP 2020 paper \"HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training\"","archived":false,"fork":false,"pushed_at":"2021-09-16T20:50:12.000Z","size":254,"stargazers_count":227,"open_issues_count":5,"forks_count":34,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-08-09T13:19:07.789Z","etag":null,"topics":["pretraining","pytorch","transformers","tvr","vision-and-language"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2005.00200","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/linjieli222.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-09-29T19:44:03.000Z","updated_at":"2024-07-08T16:44:47.000Z","dependencies_parsed_at":"2022-07-16T19:00:41.740Z","dependency_job_id":null,"html_url":"https://github.com/linjieli222/HERO","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linjieli222%2FHERO","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linjieli222%2FHERO/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linjieli222%2FHERO/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linjieli222%2FHERO/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/linjieli222","download_url":"https://codeload.github.com/linjieli222/HERO/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226850003,"owners_count":17691896,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pretraining","pytorch","transformers","tvr","vision-and-language"],"created_at":"2024-08-08T13:01:36.927Z","updated_at":"2024-11-28T02:30:48.678Z","avatar_url":"https://github.com/linjieli222.png","language":"Python","funding_links":[],"categories":["其他_机器视觉"],"sub_categories":["网络服务_其他"],"readme":"# HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training\nThis is the official repository of [HERO](https://arxiv.org/abs/2005.00200) (EMNLP 2020).\nThis repository currently supports finetuning HERO on\n[TVR](https://tvr.cs.unc.edu/), [TVQA](http://tvqa.cs.unc.edu/), [TVC](https://tvr.cs.unc.edu/tvc.html),\n[VIOLIN](https://github.com/jimmy646/violin),\n[DiDeMo](https://github.com/LisaAnne/TemporalLanguageRelease), and\n[MSR-VTT Retrieval](http://ms-multimedia-challenge.com/2017/challenge).\nThe best pre-trained checkpoint (on both [HowTo100M](https://www.di.ens.fr/willow/research/howto100m/) and [TV](http://tvqa.cs.unc.edu/) Dataset) are released. Code for HERO pre-training on TV Dataset is also available.\n\n![Overview of HERO](https://convaisharables.blob.core.windows.net/hero/hero_overview.png)\n\nSome code in this repo are copied/modified from opensource implementations made available by\n[PyTorch](https://github.com/pytorch/pytorch),\n[HuggingFace](https://github.com/huggingface/transformers),\n[OpenNMT](https://github.com/OpenNMT/OpenNMT-py),\n[Nvidia](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch),\n[TVRetrieval](https://github.com/jayleicn/TVRetrieval),\n[TVCaption](https://github.com/jayleicn/TVCaption),\nand [UNITER](https://github.com/ChenRocks/UNITER).\nThe visual frame features are extracted using [SlowFast](https://github.com/facebookresearch/SlowFast) and ResNet-152. Feature extraction code is available at [HERO_Video_Feature_Extractor](https://github.com/linjieli222/HERO_Video_Feature_Extractor)\n\n\n## Requirements\nWe provide Docker image for easier reproduction. Please install the following:\n  - [nvidia driver](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-installation) (418+), \n  - [Docker](https://docs.docker.com/install/linux/docker-ce/ubuntu/) (19.03+), \n  - [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-docker#quickstart).\n\nOur scripts require the user to have the [docker group membership](https://docs.docker.com/install/linux/linux-postinstall/)\nso that docker commands can be run without sudo.\nWe only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards.\nWe use mixed-precision training hence GPUs with Tensor Cores are recommended.\n\n## Quick Start\n*NOTE*: Please run `bash scripts/download_pretrained.sh $PATH_TO_STORAGE` to get our latest pretrained\ncheckpoints.\n\nWe use TVR as an end-to-end example for using this code base.\n\n1. Download processed data and pretrained models with the following command.\n    ```bash\n    bash scripts/download_tvr.sh $PATH_TO_STORAGE\n    ```\n    After downloading you should see the following folder structure:\n    ```\n    ├── finetune\n    │   ├── tvr_default\n    ├── video_db\n    │   ├── tv\n    ├── pretrained\n    │   └── hero-tv-ht100.pt\n    └── txt_db\n        ├── tv_subtitles.db\n        ├── tvr_train.db\n        ├── tvr_val.db\n        └── tvr_test_public.db\n    ```\n\n2. Launch the Docker container for running the experiments.\n    ```bash\n    # docker image should be automatically pulled\n    source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/video_db \\\n        $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained\n    ```\n    The launch script respects $CUDA_VISIBLE_DEVICES environment variable.\n    Note that the source code is mounted into the container under `/src` instead \n    of built into the image so that user modification will be reflected without\n    re-building the image. (Data folders are mounted into the container separately\n    for flexibility on folder structures.)\n\n\n3. Run finetuning for the TVR task.\n    ```bash\n    # inside the container\n    horovodrun -np 8 python train_vcmr.py --config config/train-tvr-8gpu.json\n\n    # for single gpu\n    python train_vcmr.py --config $YOUR_CONFIG_JSON\n    ```\n\n4. Run inference for the TVR task.\n    ```bash\n    # inference, inside the container\n    horovodrun -np 8 python eval_vcmr.py --query_txt_db /txt/tvr_val.db/ --split val \\\n        --vfeat_db /video/tv/ --sub_txt_db /txt/tv_subtitles.db/ \\\n        --output_dir /storage/tvr_default/ --checkpoint 4800 --fp16 --pin_mem\n\n    ```\n    The result file will be written at `/storage/tvr_default/results_val/results_4800_all.json`.\n    Change to  ``--query_txt_db /txt/tvr_test_public.db/ --split test_public`` for inference on test_public split.\n    Please format the result file as requested by the evaluation server for submission, our code does not include formatting.\n\n    The above command runs inference on the model we trained.\n    Feel free to replace `--output_dir` and `--checkpoint` with your own model trained in step 3.\n    Single GPU inference is also supported.\n\n\n5. Misc.\nIn case you would like to reproduce the whole preprocessing pipeline.\n\n* Text annotation and subtitle preprocessing\n    ```bash\n    # outside of the container\n    bash scripts/create_txtdb.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/ann\n    ```\n\n* Video feature extraction\n\n    We provide feature extraction code at [HERO_Video_Feature_Extractor](https://github.com/linjieli222/HERO_Video_Feature_Extractor).\n    Please follow the link for instructions to extract both 2D ResNet features and 3D SlowFast features.\n    These features are saved as separate .npz files per video.\n\n* Video feature preprocessing and saved to lmdb\n    ```bash\n    # inside of the container\n\n    # Gather slowfast/resnet feature paths\n    python scripts/collect_video_feature_paths.py  --feature_dir $PATH_TO_STORAGE/feature_output_dir\\\n        --output $PATH_TO_STORAGE/video_db --dataset $DATASET_NAME\n    \n    # Convert to lmdb\n    python scripts/convert_videodb.py --vfeat_info_file $PATH_TO_STORAGE/video_db/$DATASET_NAME/video_feat_info.pkl \\\n        --output $PATH_TO_STORAGE/video_db --dataset $DATASET_NAME --frame_length 1.5\n    ```\n    - `--frame_length`: 1 feature per \"frame_length\" seconds, we use 1.5/2 in our implementation. set it to be consistent with the one used in feature extraction.\n    - `--compress`: enable compression of lmdb\n\n## Downstream Tasks Finetuning\n\n### TVQA\nNOTE: train and inference should be ran inside the docker container\n1. download data\n    ```bash\n    # outside of the container\n    bash scripts/download_tvqa.sh $PATH_TO_STORAGE\n    ```\n2. train\n    ```bash\n    # inside the container\n    horovodrun -np 8 python train_videoQA.py --config config/train-tvqa-8gpu.json \\\n        --output_dir $TVQA_EXP\n    ```\n3. inference\n    ```bash\n    # inside the container\n    horovodrun -np 8 python eval_videoQA.py --query_txt_db /txt/tvqa_test_public.db/ --split test_public \\\n        --vfeat_db /video/tv/ --sub_txt_db /txt/tv_subtitles.db/ \\\n        --output_dir $TVQA_EXP --checkpoint $ckpt --pin_mem --fp16\n    ```\n    The result file will be written at `$TVQA_EXP/results_test_public/results_$ckpt_all.json`, which can be submitted to the evaluation server. Please format the result file as requested by the evaluation server for submission, our code does not include formatting.\n\n### TVC\n1. download data\n    ```bash\n    # outside of the container\n    bash scripts/download_tvc.sh $PATH_TO_STORAGE\n    ```\n2. train\n    ```bash\n    # inside the container\n    horovodrun -np 8 python train_tvc.py --config config/train-tvc-8gpu.json \\\n        --output_dir $TVC_EXP\n    ```\n3. inference\n    ```bash\n    # inside the container\n    python inf_tvc.py --model_dir $TVC_EXP --ckpt_step 7000 \\\n        --target_clip /txt/tvc_val_release.jsonl --output tvc_val_output.jsonl\n    ```\n    - `tvc_val_output.jsonl` will be in the official TVC submission format.\n    - change to `--target_clip /txt/tvc_test_public_release.jsonl` for test results.\n\nNOTE: see `scripts/prepro_tvc.sh` for LMDB preprocessing.\n\n### VIOLIN\n1. download data\n    ```bash\n    # outside of the container\n    bash scripts/download_violin.sh $PATH_TO_STORAGE\n    ```\n2. train\n    ```bash\n    # inside the container\n    horovodrun -np 8 python train_violin.py --config config/train-violin-8gpu.json \\\n        --output_dir $VIOLIN_EXP\n    ```\n\n### DiDeMo\n1. download data\n    ```bash\n    # outside of the container\n    bash scripts/download_didemo.sh $PATH_TO_STORAGE\n    ```\n2. train\n    ```bash\n    # inside the container\n    horovodrun -np 4 python train_vcmr.py --config config/train-didemo_video_only-4gpu.json \\\n        --output_dir $DIDEMO_EXP\n    ```\n    Switch to `config/train-didemo_video_sub-8gpu.json` for ASR augmented DiDeMo results. You can also download the original ASR [here](https://convaisharables.blob.core.windows.net/hero/asr/didemo_asr.jsonl).\n\n### MSR-VTT Retrieval\n1. download data\n    ```bash\n    # outside of the container\n    bash scripts/download_msrvtt.sh $PATH_TO_STORAGE\n    ```\n2. train\n    ```bash\n    # inside the container\n    horovodrun -np 4 python train_vr.py --config config/train-msrvtt_video_only-4gpu.json \\\n        --output_dir $MSRVTT_EXP\n    ```\n    Switch to `config/train-msrvtt_video_sub-4gpu.json` for ASR augmented MSR-VTT results. You can also download the original ASR [here](https://convaisharables.blob.core.windows.net/hero/asr/msrvtt_asr.jsonl).\n\n### How2R and How2QA\nFor raw annotation, please refer to [How2R and How2QA](https://github.com/ych133/How2R-and-How2QA).\nFeatures and code will be available soon ....\n\n## Pre-training\n1. download data\n    ```bash\n    # outside of the container\n    bash scripts/download_tv_pretrain.sh $PATH_TO_STORAGE\n    ```\n2. pre-train\n    ```bash\n    # inside of the container\n    horovodrun -np 16 python pretrain.py --config config/pretrain-tv-16gpu.json \\\n        --output_dir $PRETRAIN_EXP\n    ```\n    Unfortunately, we cannot host HowTo100M features due to its large size. Users can either process them on their own or send your inquiry to my email address (which you can find on our paper).\n\n\n## Citation\n\nIf you find this code useful for your research, please consider citing:\n```bibtex\n@inproceedings{li2020hero,\n  title={HERO: Hierarchical Encoder for Video+ Language Omni-representation Pre-training},\n  author={Li, Linjie and Chen, Yen-Chun and Cheng, Yu and Gan, Zhe and Yu, Licheng and Liu, Jingjing},\n  booktitle={EMNLP},\n  year={2020}\n}\n```\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinjieli222%2FHERO","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flinjieli222%2FHERO","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinjieli222%2FHERO/lists"}