{"id":28447834,"url":"https://github.com/opengvlab/mutr","last_synced_at":"2025-06-30T13:32:23.990Z","repository":{"id":169228967,"uuid":"645139624","full_name":"OpenGVLab/MUTR","owner":"OpenGVLab","description":"「AAAI 2024」 Referred by Multi-Modality: A Unified Temporal Transformers for Video Object Segmentation","archived":false,"fork":false,"pushed_at":"2024-06-26T14:39:46.000Z","size":8710,"stargazers_count":79,"open_issues_count":3,"forks_count":6,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-06T12:07:20.388Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGVLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-05-25T02:29:44.000Z","updated_at":"2025-05-12T05:14:25.000Z","dependencies_parsed_at":"2023-07-20T04:16:46.865Z","dependency_job_id":null,"html_url":"https://github.com/OpenGVLab/MUTR","commit_stats":null,"previous_names":["opengvlab/mutr"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/OpenGVLab/MUTR","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FMUTR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FMUTR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FMUTR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FMUTR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGVLab","download_url":"https://codeload.github.com/OpenGVLab/MUTR/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FMUTR/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262783305,"owners_count":23363512,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-06T12:07:19.772Z","updated_at":"2025-06-30T13:32:23.935Z","avatar_url":"https://github.com/OpenGVLab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MUTR: A Unified Temporal Transformer for Multi-Modal Video Object Segmentation\n\nOfficial implementation of ['Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation'](https://arxiv.org/abs/2305.16318).\n\nThe paper has been accepted by **AAAI 2024** 🔥.\n\u003c!-- \u003cdiv align=\"center\"\u003e\n\u003ch1\u003e\n\u003cb\u003e\nReferred by Multi-Modality: A Unified Temporal \u003cbr\u003e Transformer for Video Object Segmentation\n\u003c/b\u003e\n\u003c/h1\u003e\n\u003c/div\u003e --\u003e\n\n## Introduction\nWe propose **MUTR**, a **M**ulti-modal **U**nified **T**emporal transformer for **R**eferring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals, which are low-level temporal aggregation (MTA) and high-level temporal interaction (MTI).\nOn Ref-YouTube-VOS and AVSBench with respective text and audio references, MUTR achieves **+4.2\\%** and **+4.2\\%** J\u0026F improvements to *state-of-the-art* methods, demonstrating our significance for unified multi-modal VOS.\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"docs/network.png\" width=\"800\"/\u003e\u003c/p\u003e\n\n## Update\n* **TODO**: Release the code and checkpoints on AV-VOS with audio reference 📌.\n* We release the code and checkpoints of MUTR on RVOS with language reference 🔥.\n\n## Requirements\n\nWe test the codes in the following environments, other versions may also be compatible:\n\n- CUDA 11.1\n- Python 3.7\n- Pytorch 1.8.1\n\n\n## Installation\n\nPlease refer to [install.md](docs/install.md) for installation.\n\n\n\n## Data Preparation\n\nPlease refer to [data.md](docs/data.md) for data preparation.\n\nAfter the organization, we expect the directory struture to be the following:\n\n```\nMUTR/\n├── data/\n│   ├── ref-youtube-vos/\n│   ├── ref-davis/\n├── davis2017/\n├── datasets/\n├── models/\n├── scipts/\n├── tools/\n├── util/\n├── train.py\n├── engine.py\n├── inference_ytvos.py\n├── inference_davis.py\n├── opts.py\n...\n```\n\n## Get Started\n\nPlease see [Ref-YouTube-VOS](docs/Ref-YouTube-VOS.md) and [Ref-DAVIS 2017](docs/Ref-DAVIS2017.md) for details.\n\n\n## Model Zoo and Results\n\n**Note:** \n\n `--backbone` denotes the different backbones (see [here](https://github.com/OpenGVLab/MUTR/blob/c4d8901e0fca1da667922d453a004259ffb1a5cd/opts.py#L31)).\n\n `--backbone_pretrained`  denotes the path of the backbone's pretrained weight (see [here](https://github.com/OpenGVLab/MUTR/blob/c4d8901e0fca1da667922d453a004259ffb1a5cd/opts.py#L33)).\n\n\n\n\n### Ref-YouTube-VOS\n\nTo evaluate the results, please upload the zip file to the [competition server](https://codalab.lisn.upsaclay.fr/competitions/3282#participate-submit_results).\n\n\n| Backbone| J\u0026F | J | F | Model | Submission | \n| :----: | :----: | :----: | :----: | :----: | :----: |\n| ResNet-50 | 61.9 | 60.4 | 63.4 | [model](https://drive.google.com/file/d/1W1hSYd1DDFdhl46rpE1Y1OgsG1N5Zh7B/view?usp=sharing) | [link](https://drive.google.com/file/d/1ORmyM8cNgnjnXSy6SBC27wKRsORAc8Wu/view?usp=sharing) |\n| ResNet-101 | 63.6 | 61.8 | 65.4 | [model](https://drive.google.com/file/d/1tIX6jmM9MjCxbMDh89e2LugY2ul12GD6/view?usp=sharing) | [link](https://drive.google.com/file/d/1JAG6u_U5c5w0K0z3D5_r3UseN2Fmk9_y/view?usp=sharing) |\n| Swin-L | 68.4 | 66.4 | 70.4 | [model](https://drive.google.com/file/d/1PrWZjppjxEvJe2wQ7a3augG4iRQX1pLJ/view?usp=sharing) | [link](https://drive.google.com/file/d/1EYh82Ij30IJTO4Kn1-jvbbpARJybJzdj/view?usp=sharing) |\n| Video-Swin-T | 64.0 | 62.2 | 65.8 | [model](https://drive.google.com/file/d/1-TkdQksTrmB253ao99NgnmsrsQkous2V/view?usp=sharing) | [link](https://drive.google.com/file/d/14bNF3WsPResaUrB0NWmJ8GQ1eaE-Fw_7/view?usp=sharing) |\n| Video-Swin-S | 65.1 | 63.0 | 67.1 | [model](https://drive.google.com/file/d/1Z4ENlWAKIEp44HC0OH4CjsZXgQTMTvDK/view?usp=sharing) | [link](https://drive.google.com/file/d/19kWvu1fc-5hhkI1Ibzzps3pYQA4N42JU/view?usp=sharing) |\n| Video-Swin-B | 67.5 | 65.4 | 69.6 | [model](https://drive.google.com/file/d/1-ezn8H2GPTc7o6cUGN1r3DI6sDLF2J5s/view?usp=sharing) | [link](https://drive.google.com/file/d/1aYFs_DDsEFHo7Dd8pOG24O2rwyjjpEMN/view?usp=sharing) |\n| ConvNext-L | 66.7 | 64.8 | 68.7 | [model](https://drive.google.com/file/d/1w4o392nrKEDd2JqlBcu5r1ZqoSAMtFwD/view?usp=sharing) | [link](https://drive.google.com/file/d/1jASGNhitDozzN9trIlAVWsmjio7GjsA0/view?usp=sharing) |\n| ConvMAE-B | 66.9 | 64.7 | 69.1 | [model](https://drive.google.com/file/d/1_hHPVici-RIcn7ocvn6RPDSMG4gJA5Pj/view?usp=sharing) | [link](https://drive.google.com/file/d/1CORTnxJo4hWRCR4eSTcgPjxTi_5ZOlPV/view?usp=sharing) |\n\n\n\n\n### Ref-DAVIS17\n\nAs described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.\n\n| Backbone| J\u0026F | J | F | Model | \n| :----: | :----: | :----: | :----: | :----: | \n| ResNet-50 | 65.3 | 62.4 | 68.2 | [model](https://drive.google.com/file/d/1W1hSYd1DDFdhl46rpE1Y1OgsG1N5Zh7B/view?usp=sharing) | \n| ResNet-101 | 65.3 | 61.9 | 68.6 | [model](https://drive.google.com/file/d/1tIX6jmM9MjCxbMDh89e2LugY2ul12GD6/view?usp=sharing) |\n| Swin-L | 68.0 | 64.8 | 71.3 | [model](https://drive.google.com/file/d/1PrWZjppjxEvJe2wQ7a3augG4iRQX1pLJ/view?usp=sharing) |\n| Video-Swin-T | 66.5 | 63.0 | 70.0 | [model](https://drive.google.com/file/d/1-TkdQksTrmB253ao99NgnmsrsQkous2V/view?usp=sharing) |\n| Video-Swin-S | 66.1 | 62.6 | 69.8 | [model](https://drive.google.com/file/d/1Z4ENlWAKIEp44HC0OH4CjsZXgQTMTvDK/view?usp=sharing)  |\n| Video-Swin-B | 66.4 | 62.8 | 70.0 | [model](https://drive.google.com/file/d/1-ezn8H2GPTc7o6cUGN1r3DI6sDLF2J5s/view?usp=sharing) |\n| ConvNext-L | 69.0 | 65.6 | 72.4 | [model](https://drive.google.com/file/d/1w4o392nrKEDd2JqlBcu5r1ZqoSAMtFwD/view?usp=sharing) | \n| ConvMAE-B | 69.2 | 65.6 | 72.8 | [model](https://drive.google.com/file/d/1_hHPVici-RIcn7ocvn6RPDSMG4gJA5Pj/view?usp=sharing) |\n\n\n## Acknowledgement\n\nThis repo is based on [ReferFormer](https://github.com/wjn922/ReferFormer/tree/main). We also refer to the repositories [Deformable DETR](https://github.com/ashkamath/mdetr) and [MTTR](https://github.com/fundamentalvision/Deformable-DETR). Thanks for their wonderful works.\n\n\n## Citation\n\n```\n@inproceedings{yan2024referred,\n  title={Referred by multi-modality: A unified temporal transformer for video object segmentation},\n  author={Yan, Shilin and Zhang, Renrui and Guo, Ziyu and Chen, Wenchao and Zhang, Wei and Li, Hongyang and Qiao, Yu and Dong, Hao and He, Zhongjiang and Gao, Peng},\n  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},\n  volume={38},\n  number={6},\n  pages={6449--6457},\n  year={2024}\n}\n```\n\n## Contact\nIf you have any question about this project, please feel free to contact tattoo.ysl@gmail.com.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopengvlab%2Fmutr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopengvlab%2Fmutr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopengvlab%2Fmutr/lists"}