{"id":18322485,"url":"https://github.com/tencentarc/umt","last_synced_at":"2025-07-21T09:03:38.019Z","repository":{"id":38737462,"uuid":"469657056","full_name":"TencentARC/UMT","owner":"TencentARC","description":"UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.","archived":false,"fork":false,"pushed_at":"2024-04-15T13:20:31.000Z","size":1188,"stargazers_count":212,"open_issues_count":2,"forks_count":19,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-05-07T21:36:11.293Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TencentARC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-14T09:10:03.000Z","updated_at":"2025-05-04T16:30:43.000Z","dependencies_parsed_at":"2024-04-15T15:02:50.608Z","dependency_job_id":"a9f73b1b-b9c6-45de-8299-d1538fd52823","html_url":"https://github.com/TencentARC/UMT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TencentARC/UMT","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FUMT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FUMT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FUMT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FUMT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TencentARC","download_url":"https://codeload.github.com/TencentARC/UMT/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FUMT/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266270387,"owners_count":23902731,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-05T18:24:49.553Z","updated_at":"2025-07-21T09:03:38.001Z","avatar_url":"https://github.com/TencentARC.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Unified Multi-modal Transformers\n\n[![DOI](https://badgen.net/badge/DOI/10.1109%2FCVPR52688.2022.00305/blue?cache=300)](https://doi.org/10.1109/CVPR52688.2022.00305)\n[![arXiv](https://badgen.net/badge/arXiv/2203.12745/red?cache=300)](https://arxiv.org/abs/2203.12745)\n[![License](https://badgen.net/badge/License/BSD%203-Clause%20License?color=cyan\u0026cache=300)](https://github.com/TencentARC/UMT/blob/main/LICENSE)\n\nThis repository maintains the official implementation of the paper **UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection** by [Ye Liu](https://yeliu.dev/), Siyuan Li, [Yang Wu](https://scholar.google.com/citations?user=vwOQ-UIAAAAJ), [Chang Wen Chen](https://web.comp.polyu.edu.hk/chencw/), [Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ), and [Xiaohu Qie](https://scholar.google.com/citations?user=mk-F69UAAAAJ), which has been accepted by [CVPR 2022](https://cvpr2022.thecvf.com/).\n\n\u003cp align=\"center\"\u003e\u003cimg width=\"850\" src=\"https://raw.githubusercontent.com/TencentARC/UMT/main/.github/model.svg\"\u003e\u003c/p\u003e\n\n## Installation\n\nPlease refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.\n\n- CUDA 11.5.0\n- CUDNN 8.3.2.44\n- Python 3.10.0\n- PyTorch 1.11.0\n- [NNCore](https://github.com/yeliudev/nncore) 0.3.6\n\n### Install from source\n\n1. Clone the repository from GitHub.\n\n```\ngit clone https://github.com/TencentARC/UMT.git\ncd UMT\n```\n\n2. Install dependencies.\n\n```\npip install -r requirements.txt\n```\n\n## Getting Started\n\n### Download and prepare the datasets\n\n1. Download and extract the datasets.\n\n- [QVHighlights](https://huggingface.co/yeliudev/UMT/resolve/main/datasets/qvhighlights-a8559488.zip)\n- [Charades-STA](https://huggingface.co/yeliudev/UMT/resolve/main/datasets/charades-2c9f7bab.zip)\n- [YouTube Highlights](https://huggingface.co/yeliudev/UMT/resolve/main/datasets/youtube-8a12ff08.zip)\n- [TVSum](https://huggingface.co/yeliudev/UMT/resolve/main/datasets/tvsum-ec05ad4e.zip)\n\n2. Prepare the files in the following structure.\n\n```\nUMT\n├── configs\n├── datasets\n├── models\n├── tools\n├── data\n│   ├── qvhighlights\n│   │   ├── *features\n│   │   ├── highlight_{train,val,test}_release.jsonl\n│   │   └── subs_train.jsonl\n│   ├── charades\n│   │   ├── *features\n│   │   └── charades_sta_{train,test}.txt\n│   ├── youtube\n│   │   ├── *features\n│   │   └── youtube_anno.json\n│   └── tvsum\n│       ├── *features\n│       └── tvsum_anno.json\n├── README.md\n├── setup.cfg\n└── ···\n```\n\n### Train a model\n\nRun the following command to train a model using a specified config.\n\n```shell\n# Single GPU\npython tools/launch.py ${path-to-config}\n\n# Multiple GPUs\ntorchrun --nproc_per_node=${num-gpus} tools/launch.py ${path-to-config}\n```\n\n### Test a model and evaluate results\n\nRun the following command to test a model and evaluate results.\n\n```\npython tools/launch.py ${path-to-config} --checkpoint ${path-to-checkpoint} --eval\n```\n\n### Pre-train with ASR captions on QVHighlights\n\nRun the following command to pre-train a model using ASR captions on QVHighlights.\n\n```\ntorchrun --nproc_per_node=4 tools/launch.py configs/qvhighlights/umt_base_pretrain_100e_asr.py\n```\n\n## Model Zoo\n\nWe provide multiple pre-trained models and training logs here. All the models are trained with a single NVIDIA Tesla V100-FHHL-16GB GPU and are evaluated using the default metrics of the datasets.\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth rowspan=\"2\"\u003eDataset\u003c/th\u003e\n    \u003cth rowspan=\"2\"\u003eModel\u003c/th\u003e\n    \u003cth rowspan=\"2\"\u003eType\u003c/th\u003e\n    \u003cth colspan=\"2\"\u003eMR mAP\u003c/th\u003e\n    \u003cth colspan=\"2\"\u003eHD mAP\u003c/th\u003e\n    \u003cth rowspan=\"2\"\u003eDownload\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003cth\u003eR1@0.5\u003c/th\u003e\n    \u003cth\u003eR1@0.7\u003c/th\u003e\n    \u003cth\u003eR5@0.5\u003c/th\u003e\n    \u003cth\u003eR5@0.7\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\" rowspan=\"2\"\u003e\n      \u003ca href=\"https://arxiv.org/abs/2107.09609\"\u003eQVHighlights\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/qvhighlights/umt_base_200e_qvhighlights.py\"\u003eUMT-B\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e38.59\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e39.85\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_base_200e_qvhighlights-9a13c673.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_base_200e_qvhighlights.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/qvhighlights/umt_base_200e_qvhighlights.py\"\u003eUMT-B\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003ew/ PT\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e39.26\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e40.10\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_base_finetune_200e_qvhighlights-d674a657.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_base_finetune_200e_qvhighlights.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\" rowspan=\"2\"\u003e\n      \u003ca href=\"https://arxiv.org/abs/1705.02101\"\u003eCharades-STA\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/charades/umt_base_va_100e_charades.py\"\u003eUMT-B\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eV + A\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e48.31\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e29.25\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e88.79\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e56.08\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_base_va_100e_charades-b51a65aa.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_base_va_100e_charades.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/charades/umt_base_vo_100e_charades.py\"\u003eUMT-B\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eV + O\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e49.35\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e26.16\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e89.41\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e54.95\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_base_vo_100e_charades-39ec9829.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_base_vo_100e_charades.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\" rowspan=\"6\"\u003e\n      \u003ca href=\"https://doi.org/10.1007/978-3-319-10590-1_51\"\u003eYouTube\u003cbr\u003eHighlights\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/youtube/umt_small_100e_youtube_dog.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eDog\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e65.93\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_100e_youtube_dog-90f2189e.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_100e_youtube_dog.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/youtube/umt_small_100e_youtube_gym.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eGymnastics\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e75.20\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_100e_youtube_gym-fe749774.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_100e_youtube_gym.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/youtube/umt_small_100e_youtube_par.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eParkour\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e81.64\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_100e_youtube_par-4d8a9e8b.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_100e_youtube_par.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/youtube/umt_small_100e_youtube_ska.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eSkating\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e71.81\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_100e_youtube_ska-f12710a8.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_100e_youtube_ska.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/youtube/umt_small_100e_youtube_ski.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eSkiing\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e72.27\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_100e_youtube_ski-1ca38d91.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_100e_youtube_ski.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/youtube/umt_small_100e_youtube_sur.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eSurfing\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e82.71\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_100e_youtube_sur-9be4b575.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_100e_youtube_sur.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\" rowspan=\"10\"\u003e\n      \u003ca href=\"https://doi.org/10.1109/cvpr.2015.7299154\"\u003eTVSum\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/tvsum/umt_small_500e_tvsum_vt.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eVT\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e87.54\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_vt-3eff6e1b.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_vt.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/tvsum/umt_small_500e_tvsum_vu.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eVU\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e81.51\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_vu-ea40b5ee.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_vu.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/tvsum/umt_small_500e_tvsum_ga.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eGA\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e88.22\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_ga-7217ee96.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_ga.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/tvsum/umt_small_500e_tvsum_ms.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eMS\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e78.81\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_ms-a41636ac.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_ms.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/tvsum/umt_small_500e_tvsum_pk.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003ePK\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e81.42\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_pk-4ea24b6c.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_pk.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/tvsum/umt_small_500e_tvsum_pr.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003ePR\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e86.96\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_pr-815f527a.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_pr.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/tvsum/umt_small_500e_tvsum_fm.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eFM\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e75.96\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_fm-cf6ebb1d.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_fm.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/tvsum/umt_small_500e_tvsum_bk.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eBK\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e86.89\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_bk-12c75dff.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_bk.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/tvsum/umt_small_500e_tvsum_bt.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eBT\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e84.42\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_bt-3b666738.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_bt.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://github.com/TencentARC/UMT/blob/main/configs/tvsum/umt_small_500e_tvsum_ds.py\"\u003eUMT-S\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003eDS\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e—\u003c/td\u003e\n    \u003ctd align=\"center\" colspan=\"2\"\u003e79.63\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_ds-55549243.pth\"\u003emodel\u003c/a\u003e |\n      \u003ca href=\"https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_small_500e_tvsum_ds.json\"\u003emetrics\u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\nHere, `w/ PT` means initializing the model using pre-trained [weights](https://huggingface.co/yeliudev/UMT/resolve/main/checkpoints/umt_base_pretrain_100e_asr-ebae4090.pth) on ASR captions. `V`, `A`, and `O` indicate video, audio, and optical flow, respectively.\n\n## Citation\n\nIf you find this project useful for your research, please kindly cite our paper.\n\n```bibtex\n@inproceedings{liu2022umt,\n  title={UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection},\n  author={Liu, Ye and Li, Siyuan and Wu, Yang and Chen, Chang Wen and Shan, Ying and Qie, Xiaohu},\n  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n  pages={3042--3051},\n  year={2022}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fumt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftencentarc%2Fumt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fumt/lists"}