{"id":21564825,"url":"https://github.com/zinengtang/tvlt","last_synced_at":"2025-10-16T08:26:30.507Z","repository":{"id":60553308,"uuid":"542833006","full_name":"zinengtang/TVLT","owner":"zinengtang","description":"PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022 Oral)","archived":false,"fork":false,"pushed_at":"2023-02-24T03:39:35.000Z","size":5337,"stargazers_count":123,"open_issues_count":8,"forks_count":13,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-24T11:45:51.807Z","etag":null,"topics":["audio","pretraining","textless","transformers","tvlt","vision-and-audio","vision-and-language"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zinengtang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-09-28T23:30:19.000Z","updated_at":"2025-03-10T07:48:04.000Z","dependencies_parsed_at":"2023-02-15T07:31:06.521Z","dependency_job_id":null,"html_url":"https://github.com/zinengtang/TVLT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zinengtang%2FTVLT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zinengtang%2FTVLT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zinengtang%2FTVLT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zinengtang%2FTVLT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zinengtang","download_url":"https://codeload.github.com/zinengtang/TVLT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248224246,"owners_count":21068072,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio","pretraining","textless","transformers","tvlt","vision-and-audio","vision-and-language"],"created_at":"2024-11-24T10:17:18.629Z","updated_at":"2025-10-16T08:26:25.455Z","avatar_url":"https://github.com/zinengtang.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TVLT\n\n### **[TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) [NeurIPS 2022 [bib](https://github.com/zinengtang/TVLT#citation)]**  \n[Zineng Tang*](https://zinengtang.github.io/), [Jaemin Cho*](https://j-min.io/), [Yixin Nie*](https://easonnie.github.io/), [Mohit Bansal](https://www.cs.unc.edu/~mbansal/)   \n\nLearning **compact** visual-linguistic Transformer representation from low-level continuous visual 👁 and audio👂 perception signal **without assuming the prior existence of written texts or tokens**\n\n## Introduction\n\u003c!-- \u003cp align=\"center\"\u003e\n  \u003cbig\u003e\u003cb\u003eTVLT: Textless Vision-Language Transformer (NeurIPS 2022)\u003c/b\u003e\u003c/big\u003e\n\u003c/p\u003e\n\n\n\u003cp align=\"center\"\u003e\n  \u003cbig\u003e\u003cb\u003eZineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal\u003c/b\u003e\u003c/big\u003e\n\u003c/p\u003e --\u003e\n\nTransformers for Vision-Language (VL) representation learning heavily rely on text-based inputs. (Some works use audio channel only as auxiliary channel)  \n\nTVLT takes audio and visual inputs for VL representation learning with **minimal modality-specific design** and **without text-specific modules such as tokenization and automatic speech recognition (ASR)**.  \n\nTVLT is pre-trained with vision-audio mathcing and mask autoencoding **(mask and then reconstruct the continuous input of video frames and audio spectrogram)**, following the previous idea of [training scalable vision learners with mask autoencoding on images (the Vision-BERT)](https://arxiv.org/abs/2111.06377).    \n\n\u003cp align=\"center\"\u003e\n  \u003cimg align=\"middle\" width=\"800\" src=\"assets/architecture.png\"/\u003e\n\u003c/p\u003e\n\n\n\u003cdetails\u003e\n  \u003csummary\u003eMore\u003c/summary\u003e\n  \n\n  TVLT attains performance comparable to its text-based counterpart on various multimodal tasks, such as visual question answering and multimodal sentiment analysis, **with 28x faster inference speed and only 1/3 of the parameters**.\n\n  \u003cp align=\"center\"\u003e\n    \u003cimg align=\"middle\" width=\"800\" src=\"assets/teaser.png\"/\u003e\n  \u003c/p\u003e\n  \n\u003c/details\u003e\n\n## Install\n### Setup `python` environment\n```\nconda create -n TVLT python=3.8   # You can also use other environment.\n```\n\n### Install `pytorch`, `torchvision`, and `torchaudio`\nThe following version have been tested.  \n* `torch  1.10.0  1.12.1`\n* `torchvision  0.11.1  0.12.1` \n* `torchaudio  0.10.0  0.13.1`  \n\nYou can try other version of `pytorch` but make sure that it will be compatible with your `cuda` and `cudnn`.  \n\n### Install other dependencies\n```\npip install -r requirements.txt\n```\n\u003c!-- \n## Model Weights\n[Huggingface Hub](https://huggingface.co/TVLT/models). --\u003e\n\n## Demos\nGetting familiar with TVLT by trying the following demos.\n\n* [Masked Autoecoding on Video Frames and Audio Spectrogram](Demo_Video_Audio_MAE.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zinengtang/TVLT/blob/main/Demo_Video_Audio_MAE.ipynb)\n* [Sentiment Analysis on Video and Audio](Demo_Sentiment_Analysis.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zinengtang/TVLT/blob/main/Demo_Sentiment_Analysis.ipynb)\n* [Emotional Analysis on Video and Audio](Demo_Emotional_Analysis.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zinengtang/TVLT/blob/main/Demo_Emotional_Analysis.ipynb)\n\n\u003c!-- \u003cp align=\"center\"\u003e\n  \u003cbig\u003e\u003cb\u003eDemos Exmaples\u003c/b\u003e\u003c/big\u003e\n\n\u003c/p\u003e \n\n\u003cp align=\"center\"\u003e\n  \u003cimg align=\"middle\" height=\"180\" src=\"assets/demo_example.png\"/\u003e\n  \u003cimg align=\"middle\" height=\"180\" src=\"assets/demo_example2.png\"/\u003e\n  \u003cimg align=\"middle\" height=\"180\" src=\"assets/demo_example3.png\"/\u003e\n\u003c/p\u003e --\u003e\n\n\n## Training\n\n### Pretraining (Data + scripts) -\u003e [TVLT Pretraining](PT.md)\nDownload MAE checkpoint [here](https://github.com/facebookresearch/mae)\n```\n# Example\nbash scripts/pretrain_mae_vam.sh\n```\n\n### Finetuning on Downstream (Data + scripts) -\u003e [TVLT Finetuning](DS.md)\n\n```\n# Example\nbash scripts/finetune_mosei.sh\n```\n\n## Released Models\n\nThe model weights are hosted in [Huggingface Hub](https://huggingface.co/TVLT/models/tree/main).  \nIf you have tried the demos, some models should have already been downloaded.\n\nThe details of each released TVLT models are described in the table below.  \n\n| Training    | Input Format | Component | Link |\n| --- | --- | --- | --- |\n| Pre-trained on Howto100m + Yttemporal videos|Video 👁+ Audio👂|Encoder + Decoder|[[link]](https://huggingface.co/TVLT/models/resolve/main/TVLT.ckpt)|\n| Pre-trained on Howto100m + Yttemporal videos, then finetuned on CMU-MOSEI sentiment analysis|Video 👁+ Audio👂|Encoder + Classification Head|[[link]](https://huggingface.co/TVLT/models/resolve/main/TVLT-MOSEI-SA.ckpt)|\n| Pre-trained on Howto100m + Yttemporal videos, then finetuned on CMU-MOSEI emotional analysis|Video 👁+ Audio👂|Encoder + Classification Head|[[link]](https://huggingface.co/TVLT/models/resolve/main/TVLT-MOSEI-EA.ckpt)|\n| {re-trained on Howto100m + Yttemporal videos+ASR, then finetuned on CMU-MOSEI emotional analysis|Video 👁+ Text✍️|Encoder + Classification Head|[[link]](https://huggingface.co/TVLT/models/resolve/main/TVLT-MOSEI-EA-text.ckpt)|\n\n**To be contined...** (Stay tuned, more pre-trained variants coming soon)\n\u003c!-- * A TVLT model pre-trained on Howto100m + Yttemporal videos, then finetuned on CMU-MOSEI sentiment analysis:  --\u003e\n\n\u003c!-- * A TVLT model on CMU-MOSEI emotional analysis \n\n* Finetuned (Text-based) on CMU-MOSEI emotional analysis [[link]](https://huggingface.co/TVLT/models/resolve/main/TVLT-MOSEI-EA-text.ckpt) --\u003e\n\n\u003c!-- and specify with command \"load_local_path\".\n\n```\nload_local_path=\"path/to/the/checkpoint\"\n```\n\nOr use comman \"load_hub_path\", which will automatically download model for training scripts.\n\n```\nload_hub_path=\"TVLT.ckpt\"\n``` --\u003e\n\n## Folder Structure\n\nSee [Folder Structure](CODE.md)\n\n## Updates\n- [x] Initial Code Release\n- [x] Notebook Demos\n- [x] Colab\n- [ ] Release TTS question audios for VQA (We convert all the textual questions of VQAv2 to audio using Google TTS API.)   \n\n**...**\n\n\n## Recommanded Usage\n\nIn our experiment, we pre-train TVLT on HowTo100M and YTtemporal videos. However, we recommend to unlock the power of TVLT by pre-training TVLT on large-scale videos for more generic Vision-Language representation.  \nThe resultant models can be either use to directly process video (with the audio channel) inputs such as audio-image/video retrieval, audio-VQA, TTS-based VQA or to extract visual-acoustic features for other tasks such as speech translation, multimodal content understanding, etc.\n\n\n## Citation\n```\n@inproceedings{tang2022tvlt,\n  title     = {TVLT: Textless Vision-Language Transformer},\n  author    = {Zineng Tang and Jaemin Cho and Yixin Nie and Mohit Bansal},\n  booktitle = {NeurIPS},\n  year      = {2022}\n}\n```\n\n## Acknowledgement\n\nThe idea of this paper is heavily inspired by [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377).  \nOur codebase is based on [ViLT](https://github.com/dandelin/ViLT). \nWe thank the authors for their open-source contributions.\n\n## Contact\n\nZineng Tang (zn.tang.terran@gmail.com)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzinengtang%2Ftvlt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzinengtang%2Ftvlt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzinengtang%2Ftvlt/lists"}