{"id":19279893,"url":"https://github.com/showlab/egovlp","last_synced_at":"2025-07-16T04:38:46.552Z","repository":{"id":41311022,"uuid":"498228650","full_name":"showlab/EgoVLP","owner":"showlab","description":"[NeurIPS 2022] Egocentric Video-Language Pretraining","archived":false,"fork":false,"pushed_at":"2024-05-09T05:52:12.000Z","size":2066,"stargazers_count":241,"open_issues_count":5,"forks_count":20,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-05-20T10:03:26.745Z","etag":null,"topics":["egocentric-vision","pretraining","pytorch","video-language"],"latest_commit_sha":null,"homepage":"https://arxiv.org/pdf/2206.01670.pdf","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/showlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-31T07:20:06.000Z","updated_at":"2025-05-07T02:23:51.000Z","dependencies_parsed_at":"2024-05-09T06:49:13.359Z","dependency_job_id":null,"html_url":"https://github.com/showlab/EgoVLP","commit_stats":{"total_commits":47,"total_committers":2,"mean_commits":23.5,"dds":"0.021276595744680882","last_synced_commit":"928406c2d5d42b4d76421f6500e43feb61fc9d68"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/showlab/EgoVLP","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FEgoVLP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FEgoVLP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FEgoVLP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FEgoVLP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/showlab","download_url":"https://codeload.github.com/showlab/EgoVLP/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FEgoVLP/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265482006,"owners_count":23773990,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["egocentric-vision","pretraining","pytorch","video-language"],"created_at":"2024-11-09T21:16:13.074Z","updated_at":"2025-07-16T04:38:46.501Z","avatar_url":"https://github.com/showlab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# EgoVLP: Egocentric Video-Language Pretraining\n\n[Project page](https://qinghonglin.github.io/EgoVLP/) | [arXiv](https://arxiv.org/pdf/2206.01670.pdf)\n\n\u003e **TL;DR:** We pioneer Egocentric Video-Language Pretraining from pretraining dataset, model and development benchmark; the resulted pretrained model exhibits strong performance on five downstream tasks across three egocentric datasets.\n\n![EgoVLP](figures/egovlp_framework.jpg)\n\n## 📢 News\n\n- [2023.7.12] [**EgoVLPv2**](https://shramanpramanick.github.io/EgoVLPv2/) has been released, directed by [Shraman](https://shramanpramanick.github.io/), with stronger performance and higher efficiency, which has been accepted by [**ICCV 2023**](https://iccv2023.thecvf.com/)!\n- [2022.12.22] We clean the code and provide video features to power NLQ \u0026 MQ, Ego4D challenges.\n- [2022.9.15] EgoVLP got accepted by [**NeurIPS 2022**](https://nips.cc/) as **Spotlight**!\n- [2022.6.30] We release the first version of the EgoVLP codebase.\n- [2022.6.20] Our EgoVLP won [**1st place** in OSCC](https://eval.ai/web/challenges/challenge-page/1627/overview) \u0026 [**2nd place** in NLQ](https://eval.ai/web/challenges/challenge-page/1629/overview) \u0026 [**3rd place** in PNR](https://eval.ai/web/challenges/challenge-page/1622/overview) @ [Ego4D  Challenge 2022](https://ego4d-data.org/docs/challenge/), and [**1st place** in Multi-Instance Retrieval](https://codalab.lisn.upsaclay.fr/competitions/617#learn_the_details) @ [EPIC-Kitchens Challenge 2022](https://epic-kitchens.github.io/2022), hosted by CVPR 2022.\n- [2022.6.10] We release the EgoClip pretraining dataset.\n- [2022.6.3] We release the arXiv paper.\n\n## 📝 Preparation\n### Install dependencies \n```bash\nconda env create -f environment.yml\nsource activate egovlp\n```\n\n### Ego4D videos and metadata\n\u003e You can skip the source video download if pretraining is not required.\n1. Follow the guideline [here](https://ego4d-data.org/docs/start-here/#cli-download), download the following to  `{PATH_TO_EGO4D}`\n   - Ego4D source videos (nearly 7 TB).\n   - Ego4D videos metadata `manifest.csv` and benchmark metadata, e.g., `nlq_train.json` for NLQ.\n   - Create the dir `dataset` and add a soft link by `ln -s {PATH_TO_EGO4D} dataset/ego4d`.\n\n2. For effectively pretraining, we compress videos in the following way:\n   - Resize the source videos with a short size equal to 256 by script  `utils/video_resize.py`.\n   - Chunk the resized videos to multiple segments (up to 600 sec) by script `utils/video_chunk.py`.\n\n### EgoClip: an egocentric video-language pretraining dataset\n- Download the EgoClip metadata from [here](https://drive.google.com/file/d/1-aaDu_Gi-Y2sQI_2rsI2D1zvQBJnHpXl/view?usp=sharing) and put it to `dataset/egoclip.csv`.\n\n- For the usage of EgoClip, please see our dataloader `data_loader/EgoClip_EgoMCQ_dataset.py`. The data format of EgoClip is:\n  ```python\n  import pandas as pd\n  \n  metadata = pd.read_csv('dataset/egoclip.csv', sep='\\t', error_bad_lines=False)\n  print(metadata.shape[0])\n  print(metadata.iloc[0])\n  \n  # Out:\n  3847723                                                         # Num of clips for EgoClip\n  \n  clip_idx                                                     0  # the idx of clip\n  video_uid                 001e3e4e-2743-47fc-8564-d5efd11f9e90  # the uid of source video\n  video_dur                                           128.033333  # the duration of source video\n  narration_source                              narration_pass_1  # the source of annotator\n  narration_ind                                                0  # the idx of narration\n  narration_time                                          3.3445  # the narration timestamp\n  clip_start                                            2.967651  # the start timestamp of clip\n  clip_end                                              3.721266  # the end timestamp of clip\n  clip_text           #C C picks a bag of clothes from the floor  # the narration of clip\n  tag_verb                                                  [93]  # the verb idx of the narration\n  tag_noun                                        [192, 115, 12]  # the noun idx of the narration\n  ```\n  \n^ The terms `tag_verb` and `tag_noun` are used for EgoNCE pretraining objective, which considers synonyms. For example, `pick`, `collect`, `gather` are all belong to the verb parent with idx 93: `take_(pick,_grab,_get)`.\nThe mapping dictionary can be found [here](https://drive.google.com/drive/folders/16fUv5rrZmt06Ty3QAEweDpveC-84RI9Z?usp=sharing).\n\n### EgoMCQ: an egocentric video-language development set\n\n- Download the EgoMCQ metadata from [here](https://drive.google.com/file/d/1-5iRYf4BCHmj4MYQYFRMY4bhsWJUN3rW/view?usp=sharing) and put it to `dataset/egomcq.json`.\n- EgoMCQ is a benchmark for video-language multiple-choice questions. Given a text query, we want the model to choose the correct video clip from five candidates that sampled from two settings: `inter-video` or `intra-video`.\n- For the usage of EgoMCQ, please see our dataloader `data_loader/EgoClip_EgoMCQ_dataset.py`.\n\n![EgoMCQ](figures/egomcq.jpg)\n\n## 🏋️‍️ Pretraining\nThis code is built on PyTorch with DistributedDataParallel (DDP). We pretrain EgoVLP on 4 nodes, each with 8 A100 GPUs (10 epochs in about two days).\n\n- Train on EgoClip:  `python3 -m torch.distributed.launch \n  --nnodes=$HOST_NUM \n  --node_rank=$INDEX \n  --master_addr $CHIEF_IP \n  --nproc_per_node $HOST_GPU_NUM \n  --master_port 8081 \n  run/train_egoclip.py --config configs/pt/egoclip.json`\n  \n- Test on EgoMCQ:  `python3 -m torch.distributed.launch \n  --nnodes=$HOST_NUM \n  --node_rank=$INDEX \n  --master_addr $CHIEF_IP \n  --nproc_per_node $HOST_GPU_NUM \n  --master_port 8081 \n  run/train_egoclip.py --config configs/eval/egomcq.json`\n  \n- Monitor the EgoMCQ curve during pretraining: `tensorboard --logdir results  --bind_all`\n\n## 🗄 Pretrained Weights\n- We have released our pretrained EgoVLP model (EgoClip w/ EgoNCE) with best performance on EgoMCQ (90.7% inter-video \u0026 57.2% intra-video) in [EgoVLP_PT_BEST](https://drive.google.com/file/d/1-cP3Gcg0NGDcMZalgJ_615BQdbFIbcj7/view?usp=sharing).\n- Please download and put the checkpoint under: `pretrained/`\n\n^ This checkpoint is used for EPIC-Kitchens, NLQ, MQ, OSSC, and PNR tasks, except for Charades-Ego. Since we found that VLP (CC3M+WebVid2M, EgoClip) alway degrades significantly on Charades-Ego after the first epoch, we evaluate Charades-Ego using the first pretraining epoch weights of EgoVLP in [EgoVLP_PT_EPO1](https://drive.google.com/file/d/10lRA4Fldt-c5Azh5D2Zvjwi-_YR5ve5e/view?usp=sharing).\n\n^^ You can use our checkpoint to power other egocentric video benchmarks. :)\n\n## 🔧 Downstream Tasks\n### EPIC-Kitchens MIR\n\n- **Preparation:**\n\n1. Follow the instruction [here](https://epic-kitchens.github.io/2022), download the EPIC-Kitchens dataset (RGB frames) and annotation to path: `dataset/epic-kitchens/`\n2. Follow the instruction [here -\u003e How do I create the relevance matrix?](https://github.com/mwray/Joint-Part-of-Speech-Embeddings) to construct a relevance matrix for evaluation.\n\n- **Results:**\n\n| Model   | Mode                                              | # Frames | Video-Text PT     | Weights                                                 | mAP (V2T) | mAP (T2V) | mAP (Avg) | nDCG (V2T) | nDCG (T2V) | nDCG (Avg) |\n| ------- | ------------------------------------------------- | ------ | ----------------- | ------------------------------------------------------------ | --------- | --------- | --------- | ---------- | ---------- | ---------- |\n| EgoVLP  | Zero-shot                                         | 4      | EgoClip w/ EgoNCE | [EgoVLP_PT_BEST](https://drive.google.com/file/d/1-cP3Gcg0NGDcMZalgJ_615BQdbFIbcj7/view?usp=sharing) | 19.4      | 13.9      | 16.6      | 24.1       | 22.0       | 23.1       |\n| EgoVLP  | Fine-tuning w/\u003cbr /\u003e MI-MM                        | 16     | EgoClip w/ EgoNCE | [EgoVLP_FT_EPIC](https://drive.google.com/file/d/1-YEHZ-WBCnO-LZEsDF14jo-pLSJKTp2G/view?usp=sharing) | 49.9      | 40.5      | 45.0      | 60.9       | 57.9       | 59.4       |\n| EgoVLP+ | Fine-tuning w/ Adaptive-MI-MM + Dual-softmax | 16     | EgoClip w/ EgoNCE | [EgoVLP_FT_EPIC+](https://drive.google.com/file/d/1-SOQeXc-xSn544sJzgFLhC95hkQsm0BR/view?usp=sharing) | **53.8**  | **40.9**  | **47.4**  | **63.3**   | **59.6**   | **61.4**   |\n\n^ EgoVLP+ means our submission for [Multi-Instance Retrieval@EPIC-Kitchens Challenge 2022](https://arxiv.org/abs/2207.01334), which equips Adaptive MI-MM loss and Dual-softmax for prediction.\n\n- Train: `python3 -m torch.distributed.launch --nnodes=$HOST_NUM  --node_rank=$INDEX  --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_epic.py --config configs/ft/epic.json`\n\n- Test: `python3 run/test_epic.py`\n\n### Charades-Ego\n- **Preparation:**\n\n1. Follow the instruction [here](https://prior.allenai.org/projects/charades-ego), download the Charades-Ego dataset (480p) and annotation to path: `dataset/charades/`\n2. Create a training metadata via `utils/charades_meta.py` \n\n- **Results:**\n\n| Model  | Mode        | # Frames | Video-Text PT     | Weights | mAP  |\n| ------ | ----------- | -------- | ----------------- | ----------------- | ---- |\n| EgoVLP | Zero-shot   | 16       | EgoClip w/ EgoNCE | [EgoVLP_PT_EPO1](https://drive.google.com/file/d/10lRA4Fldt-c5Azh5D2Zvjwi-_YR5ve5e/view?usp=sharing)                  | 25.0 |\n| EgoVLP | Fine-tuning w/ InfoNCE| 16       | EgoClip w/ EgoNCE | [EgoVLP_FT_CHARADES](https://drive.google.com/file/d/1-xWVDH7XO4pi6Hj5QRpKVz6y-QkqcFlQ/view?usp=sharing)                  | **32.1** |\n\n- Train: `python3 -m torch.distributed.launch --nnodes=$HOST_NUM  --node_rank=$INDEX  --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_charades.py --config configs/ft/charades.json`\n\n- Test: `python3 run/test_charades.py`\n\n### NLQ @ Ego4D\n- **Preparation:** \n\n1. Make sure you have prepared the NLQ metadata. \n2. For the video branch, download the EgoVLP clip-level features for NLQ. ^ We get these dense video features (fps=1.87) by script `run/test_nlq.py`.\n   - [NLQ Train \u0026 Val](https://drive.google.com/file/d/1TXBlLDqDuL_XPCuXlgiikEfVXO8Ly6eM/view)\n   - [NLQ Test](https://drive.google.com/file/d/1-CGZg9t-kpW5bmg9M62VHk5eYTllsPKV/view)\n3. For the text branch, you can extract EgoVLP text features: `python3 run/test_nlq.py --subsample 'text'` or use our pretrained text encoder.\n4. Fine-tune the [VSLNet](https://github.com/EGO4D/episodic-memory/tree/main/NLQ/VSLNet) or other methods by replacing their input video-text features.\n\n^ We provide [our VSLNet codebase](https://github.com/QinghongLin/EgoVLP_episodic_memory) which adapts EgoVLP features as an example, you can refer to the data loader and text encoder.\n\n^ Our EgoVLP brings consistent improvement over multiple NLQ challenge baselines. \n\n| Model  | Video-Text Pre-extrated Features        | R@1, IoU=0.3 | R@5, IoU=0.3  | R@1, IoU=0.5 | R@5, IoU=0.5   |\n| ------ | ----------- | -------- | ----------------- | ----------------- | ---- |\n| [VSLNet](https://github.com/EGO4D/episodic-memory/tree/main/NLQ/VSLNet) | SlowFast + BERT  | 5.45 | 10.74 | 3.12 | 6.63\n| [VSLNet](https://github.com/EGO4D/episodic-memory/tree/main/NLQ/VSLNet) | EgoVLP | **10.84** | **18.84** | **6.81** | **13.45**\n| [CONE](https://arxiv.org/abs/2209.10918) | SlowFast + BERT  | 10.40 | 22.74 | 5.03 | 11.87\n| [CONE](https://arxiv.org/abs/2209.10918) | EgoVLP | **14.15** | **30.33** | **8.18** | **18.02**\n\n### MQ @ Ego4D\n- **Preparation:**\n\n1. Make sure you have prepared the MQ metadata.\n2. Download the EgoVLP clip-level features for MQ. ^ We get these dense video features (fps=1.87) by script `run/test_mq.py`.\n   - [MQ Train \u0026 Val]( https://drive.google.com/file/d/1-HEUCdyfNX7CBZhz40yiyTr7to_p7wUi/view )\n   - [MQ Test]( https://drive.google.com/file/d/1-JmezY3eIkHKJ1JBA_AA8QWBoY3W2HpS/view)\n3. Fine-tune the [VSGN](https://github.com/EGO4D/episodic-memory/tree/main/MQ) or other methods by replacing their input video features.\n\n^ We provide [our VSGN codebase](https://github.com/QinghongLin/EgoVLP_episodic_memory) which adapts EgoVLP features as an example, you can refer to the data loader.\n\n^ Our EgoVLP brings consistent improvement over multiple MQ challenge baselines. \n\n| Model  | Video Pre-extrated Features         | R@1, IoU=0.5 | R@5, IoU=0.5   | mAP\n| ------ | ----------- | --------  | ---- | ---- |\n| [VSGN](https://github.com/EGO4D/episodic-memory/tree/main/MQ) | SlowFast  | 25.16 | 46.18 | 6.03\n| [VSGN](https://github.com/EGO4D/episodic-memory/tree/main/MQ) | EgoVLP | **30.14** | **51.98** | **11.39** |\n| [ActionFormer](https://arxiv.org/pdf/2211.09074.pdf) | SlowFast + Omnivore  | 33.46 | - | 17.17\n| [ActionFormer](https://arxiv.org/pdf/2211.09074.pdf) | SlowFast + Omnivore + EgoVLP | **36.84** | - | **20.90** | -\n\n### OSCC @ Ego4D\n- **Preparation:**\n \n1. Make sure you have prepared the OSCC videos and metadata.\n2. Extract the clip frame follow the instruction [here -\u003e Data Preparation](https://github.com/EGO4D/hands-and-objects/tree/main/state-change-localization-classification/i3d-resnet50).\n\n- Train: `python3 -m torch.distributed.launch --nnodes=$HOST_NUM  --node_rank=$INDEX  --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_oscc.py --config configs/ft/oscc.json`\n\n| Model  | Video-Text Pretrained | OSCC Acc %\n| ------ | ----------- | --------  | \n| TimeSformer | ImageNet Init. | 70.3\n| TimeSformer | EgoVLP | **73.9** |\n\n### PNR @ Ego4D\n- **Preparation:** Same as OSCC.\n- Train: `python3 -m torch.distributed.launch --nnodes=$HOST_NUM  --node_rank=$INDEX  --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_pnr.py --config configs/ft/pnr.json`\n\n| Model  | Video-Text Pretrained | PNR Err %\n| ------ | ----------- | --------  | \n| TimeSformer | ImageNet Init. | 0.616\n| TimeSformer | EgoVLP | 0.622 |\n\n^ We found VLP effect is minor in the PNR task.\n\n## 🎓 Citation\n\nIf you find our work helps, please cite our paper.\n\n```bibtex\n@article{kevin2022egovlp,\n  title={Egocentric Video-Language Pretraining},\n  author={Lin, Kevin Qinghong and Wang, Alex Jinpeng and Soldan, Mattia and Wray, Michael and Yan, Rui and Xu, Eric Zhongcong and Gao, Difei and Tu, Rongcheng and Zhao, Wenzhe and Kong, Weijie and others},\n  journal={arXiv preprint arXiv:2206.01670},\n  year={2022}\n}\n```\n\n## ✉️ Contact\n\nThis repo is maintained by [Kevin](https://github.com/QinghongLin). Questions and discussions are welcome via `kevin.qh.lin@gmail.com`.\n\nWe are willing to merge results and codes if transfer our EgoVLP to other egocentric tasks or datasets.\n\n## 🙏 Acknowledgements\n\nThis codebase is based on [Frozen](https://github.com/m-bain/frozen-in-time). \n\nThanks to [Alex](https://github.com/fingerrec) for the help with DDP and [Mattia](https://github.com/Soldelli) for the help with NLQ and MQ benchmarks.\n\n## LICENSE\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshowlab%2Fegovlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshowlab%2Fegovlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshowlab%2Fegovlp/lists"}