{"id":13787837,"url":"https://github.com/gabeur/mmt","last_synced_at":"2025-05-12T02:30:35.593Z","repository":{"id":41490969,"uuid":"300224086","full_name":"gabeur/mmt","owner":"gabeur","description":"Multi-Modal Transformer for Video Retrieval","archived":false,"fork":false,"pushed_at":"2024-10-09T17:16:48.000Z","size":823,"stargazers_count":258,"open_issues_count":0,"forks_count":41,"subscribers_count":10,"default_branch":"master","last_synced_at":"2024-11-18T01:39:16.670Z","etag":null,"topics":["fusion","language","multimodal","nlp","video","vision"],"latest_commit_sha":null,"homepage":"http://thoth.inrialpes.fr/research/MMT/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gabeur.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-10-01T09:35:36.000Z","updated_at":"2024-11-06T02:49:33.000Z","dependencies_parsed_at":"2022-09-21T10:43:29.846Z","dependency_job_id":null,"html_url":"https://github.com/gabeur/mmt","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gabeur%2Fmmt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gabeur%2Fmmt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gabeur%2Fmmt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gabeur%2Fmmt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gabeur","download_url":"https://codeload.github.com/gabeur/mmt/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253662531,"owners_count":21944090,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fusion","language","multimodal","nlp","video","vision"],"created_at":"2024-08-03T21:00:32.198Z","updated_at":"2025-05-12T02:30:34.935Z","avatar_url":"https://github.com/gabeur.png","language":"Python","funding_links":[],"categories":["其他_机器视觉","Python","Implementations"],"sub_categories":["网络服务_其他"],"readme":"# MMT: Multi-modal Transformer for Video Retrieval\n\n![architecture](figs/Cross_mod_architecture.png)\n\n## Intro\n\nThis repository provides the code for training our video retrieval cross-modal architecture.\nOur approach is described in the paper \"Multi-modal Transformer for Video Retrieval\" [[arXiv](https://arxiv.org/abs/2007.10639), [webpage](http://thoth.inrialpes.fr/research/MMT/)]\n\nOur proposed Multi-Modal Transformer (MMT) aggregates sequences of multi-modal features (e.g. appearance, motion, audio, OCR, etc.) from a video. It then embeds the aggregated multi-modal feature to a shared space with text for retrieval. It achieves state-of-the-art performance on MSRVTT, ActivityNet and LSMDC datasets.\n\n## Installing\n```bash\ngit clone https://github.com/gabeur/mmt.git\n```\n\n## Requirements\n* Python 3.7\n* Pytorch 1.4.0\n* Transformers 3.1.0\n* Numpy 1.18.1\n\n```bash\ncd mmt\n# Install the requirements\npip install -r requirements.txt\n```\n\n## ECCV paper\n\nIn order to reproduce the results of our ECCV 2020 Spotlight paper, please first download the video features from [this page](http://thoth.inrialpes.fr/research/video-features/) by running the following commands:\n\n```bash\n# Create and move to mmt/data directory\nmkdir data\ncd data\n# Download the video features\nwget http://pascal.inrialpes.fr/data2/vgabeur/video-features/MSRVTT.tar.gz\nwget http://pascal.inrialpes.fr/data2/vgabeur/video-features/activity-net.tar.gz\nwget http://pascal.inrialpes.fr/data2/vgabeur/video-features/LSMDC.tar.gz\n# Extract the video features\ntar -xvf MSRVTT.tar.gz\ntar -xvf activity-net.tar.gz\ntar -xvf LSMDC.tar.gz\n```\n\nDownload the checkpoints:\n```bash\n# Create and move to mmt/data/checkpoints directory\nmkdir checkpoints\ncd checkpoints\n# Download checkpoints\nwget http://pascal.inrialpes.fr/data2/vgabeur/mmt/data/checkpoints/HowTo100M_full_train.pth\nwget http://pascal.inrialpes.fr/data2/vgabeur/mmt/data/checkpoints/MSRVTT_jsfusion_trainval.pth\nwget http://pascal.inrialpes.fr/data2/vgabeur/mmt/data/checkpoints/prtrn_MSRVTT_jsfusion_trainval.pth\n```\n\nYou can then run the following scripts:\n\n### MSRVTT\n\n#### Training from scratch\n\nTraining + evaluation:\n```bash\npython -m train --config configs_pub/eccv20/MSRVTT_jsfusion_trainval.json\n```\n\nEvaluation from checkpoint:\n```bash\npython -m train --config configs_pub/eccv20/MSRVTT_jsfusion_trainval.json --only_eval --load_checkpoint data/checkpoints/MSRVTT_jsfusion_trainval.pth\n```\n\nExpected results:\n```\nMSRVTT_jsfusion_test:\nt2v_metrics/R1/final_eval: 24.1\nt2v_metrics/R5/final_eval: 56.4\nt2v_metrics/R10/final_eval: 69.6\nt2v_metrics/R50/final_eval: 90.4\nt2v_metrics/MedR/final_eval: 4.0\nt2v_metrics/MeanR/final_eval: 25.797\nt2v_metrics/geometric_mean_R1-R5-R10/final_eval: 45.56539387310681\nv2t_metrics/R1/final_eval: 25.9\nv2t_metrics/R5/final_eval: 58.1\nv2t_metrics/R10/final_eval: 69.3\nv2t_metrics/R50/final_eval: 90.8\nv2t_metrics/MedR/final_eval: 4.0\nv2t_metrics/MeanR/final_eval: 22.852\nv2t_metrics/geometric_mean_R1-R5-R10/final_eval: 47.06915231647284\n```\n\n#### Finetuning from a HowTo100M pretrained model:\n\nTraining + evaluation:\n```bash\npython -m train --config configs_pub/eccv20/prtrn_MSRVTT_jsfusion_trainval.json --load_checkpoint data/checkpoints/HowTo100M_full_train.pth\n```\n\nEvaluation from checkpoint:\n```bash\npython -m train --config configs_pub/eccv20/prtrn_MSRVTT_jsfusion_trainval.json --only_eval --load_checkpoint data/checkpoints/prtrn_MSRVTT_jsfusion_trainval.pth\n```\n\nExpected results:\n```\nMSRVTT_jsfusion_test:\nt2v_metrics/R1/final_eval: 25.8\nt2v_metrics/R5/final_eval: 57.2\nt2v_metrics/R10/final_eval: 69.3\nt2v_metrics/R50/final_eval: 90.7\nt2v_metrics/MedR/final_eval: 4.0\nt2v_metrics/MeanR/final_eval: 22.355\nt2v_metrics/geometric_mean_R1-R5-R10/final_eval: 46.76450299746546\nv2t_metrics/R1/final_eval: 26.1\nv2t_metrics/R5/final_eval: 57.8\nv2t_metrics/R10/final_eval: 68.5\nv2t_metrics/R50/final_eval: 90.6\nv2t_metrics/MedR/final_eval: 4.0\nv2t_metrics/MeanR/final_eval: 20.056\nv2t_metrics/geometric_mean_R1-R5-R10/final_eval: 46.92665942024404\n```\n\n### ActivityNet\n\nTraining from scratch\n```bash\npython -m train --config configs_pub/eccv20/ActivityNet_val1_trainval.json\n```\n\n### LSMDC\n\nTraining from scratch\n```bash\npython -m train --config configs_pub/eccv20/LSMDC_full_trainval.json\n```\n\n## References\nIf you find this code useful or use the \"s3d\"(motion) video features, please consider citing:\n```\n@inproceedings{gabeur2020mmt,\n    TITLE = {{Multi-modal Transformer for Video Retrieval}},\n    AUTHOR = {Gabeur, Valentin and Sun, Chen and Alahari, Karteek and Schmid, Cordelia},\n    BOOKTITLE = {{European Conference on Computer Vision (ECCV)}},\n    YEAR = {2020}\n}\n```\n\nThe features \"face\", \"ocr\", \"rgb\"(appearance), \"scene\" and \"speech\" were extracted by the authors of [Collaborative Experts](https://github.com/albanie/collaborative-experts). If you use those features, please consider citing:\n```\n@inproceedings{Liu2019a,\n    author = {Liu, Y. and Albanie, S. and Nagrani, A. and Zisserman, A.},\n    booktitle = {British Machine Vision Conference},\n    title = {Use What You Have: Video retrieval using representations from collaborative experts},\n    date = {2019}\n}\n```\n\n## Acknowledgements\n\nOur code is structured following the [template](https://github.com/victoresque/pytorch-template) proposed by @victoresque. Our code is based on the implementation of [Collaborative Experts](https://github.com/albanie/collaborative-experts), [Transformers](https://github.com/huggingface/transformers) and [Mixture of Embedding Experts](https://github.com/antoine77340/Mixture-of-Embedding-Experts). We thank Maksim Dzabraev for discovering bugs in our implementation and notifying us of the issues (See the issues section for more detail).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgabeur%2Fmmt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgabeur%2Fmmt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgabeur%2Fmmt/lists"}