https://github.com/zinengtang/tvlt
PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022 Oral)
https://github.com/zinengtang/tvlt
audio pretraining textless transformers tvlt vision-and-audio vision-and-language
Last synced: 6 months ago
JSON representation
PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022 Oral)
- Host: GitHub
- URL: https://github.com/zinengtang/tvlt
- Owner: zinengtang
- License: mit
- Created: 2022-09-28T23:30:19.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-02-24T03:39:35.000Z (over 2 years ago)
- Last Synced: 2025-03-24T11:45:51.807Z (7 months ago)
- Topics: audio, pretraining, textless, transformers, tvlt, vision-and-audio, vision-and-language
- Language: Jupyter Notebook
- Homepage:
- Size: 5.09 MB
- Stars: 123
- Watchers: 1
- Forks: 13
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# TVLT
### **[TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) [NeurIPS 2022 [bib](https://github.com/zinengtang/TVLT#citation)]**
[Zineng Tang*](https://zinengtang.github.io/), [Jaemin Cho*](https://j-min.io/), [Yixin Nie*](https://easonnie.github.io/), [Mohit Bansal](https://www.cs.unc.edu/~mbansal/)Learning **compact** visual-linguistic Transformer representation from low-level continuous visual 👁 and audio👂 perception signal **without assuming the prior existence of written texts or tokens**
## Introduction
Transformers for Vision-Language (VL) representation learning heavily rely on text-based inputs. (Some works use audio channel only as auxiliary channel)
TVLT takes audio and visual inputs for VL representation learning with **minimal modality-specific design** and **without text-specific modules such as tokenization and automatic speech recognition (ASR)**.
TVLT is pre-trained with vision-audio mathcing and mask autoencoding **(mask and then reconstruct the continuous input of video frames and audio spectrogram)**, following the previous idea of [training scalable vision learners with mask autoencoding on images (the Vision-BERT)](https://arxiv.org/abs/2111.06377).
![]()
More
TVLT attains performance comparable to its text-based counterpart on various multimodal tasks, such as visual question answering and multimodal sentiment analysis, **with 28x faster inference speed and only 1/3 of the parameters**.
![]()
## Install
### Setup `python` environment
```
conda create -n TVLT python=3.8 # You can also use other environment.
```### Install `pytorch`, `torchvision`, and `torchaudio`
The following version have been tested.
* `torch 1.10.0 1.12.1`
* `torchvision 0.11.1 0.12.1`
* `torchaudio 0.10.0 0.13.1`You can try other version of `pytorch` but make sure that it will be compatible with your `cuda` and `cudnn`.
### Install other dependencies
```
pip install -r requirements.txt
```## Demos
Getting familiar with TVLT by trying the following demos.* [Masked Autoecoding on Video Frames and Audio Spectrogram](Demo_Video_Audio_MAE.ipynb) [](https://colab.research.google.com/github/zinengtang/TVLT/blob/main/Demo_Video_Audio_MAE.ipynb)
* [Sentiment Analysis on Video and Audio](Demo_Sentiment_Analysis.ipynb) [](https://colab.research.google.com/github/zinengtang/TVLT/blob/main/Demo_Sentiment_Analysis.ipynb)
* [Emotional Analysis on Video and Audio](Demo_Emotional_Analysis.ipynb) [](https://colab.research.google.com/github/zinengtang/TVLT/blob/main/Demo_Emotional_Analysis.ipynb)## Training
### Pretraining (Data + scripts) -> [TVLT Pretraining](PT.md)
Download MAE checkpoint [here](https://github.com/facebookresearch/mae)
```
# Example
bash scripts/pretrain_mae_vam.sh
```### Finetuning on Downstream (Data + scripts) -> [TVLT Finetuning](DS.md)
```
# Example
bash scripts/finetune_mosei.sh
```## Released Models
The model weights are hosted in [Huggingface Hub](https://huggingface.co/TVLT/models/tree/main).
If you have tried the demos, some models should have already been downloaded.The details of each released TVLT models are described in the table below.
| Training | Input Format | Component | Link |
| --- | --- | --- | --- |
| Pre-trained on Howto100m + Yttemporal videos|Video 👁+ Audio👂|Encoder + Decoder|[[link]](https://huggingface.co/TVLT/models/resolve/main/TVLT.ckpt)|
| Pre-trained on Howto100m + Yttemporal videos, then finetuned on CMU-MOSEI sentiment analysis|Video 👁+ Audio👂|Encoder + Classification Head|[[link]](https://huggingface.co/TVLT/models/resolve/main/TVLT-MOSEI-SA.ckpt)|
| Pre-trained on Howto100m + Yttemporal videos, then finetuned on CMU-MOSEI emotional analysis|Video 👁+ Audio👂|Encoder + Classification Head|[[link]](https://huggingface.co/TVLT/models/resolve/main/TVLT-MOSEI-EA.ckpt)|
| {re-trained on Howto100m + Yttemporal videos+ASR, then finetuned on CMU-MOSEI emotional analysis|Video 👁+ Text✍️|Encoder + Classification Head|[[link]](https://huggingface.co/TVLT/models/resolve/main/TVLT-MOSEI-EA-text.ckpt)|**To be contined...** (Stay tuned, more pre-trained variants coming soon)
## Folder Structure
See [Folder Structure](CODE.md)
## Updates
- [x] Initial Code Release
- [x] Notebook Demos
- [x] Colab
- [ ] Release TTS question audios for VQA (We convert all the textual questions of VQAv2 to audio using Google TTS API.)**...**
## Recommanded Usage
In our experiment, we pre-train TVLT on HowTo100M and YTtemporal videos. However, we recommend to unlock the power of TVLT by pre-training TVLT on large-scale videos for more generic Vision-Language representation.
The resultant models can be either use to directly process video (with the audio channel) inputs such as audio-image/video retrieval, audio-VQA, TTS-based VQA or to extract visual-acoustic features for other tasks such as speech translation, multimodal content understanding, etc.## Citation
```
@inproceedings{tang2022tvlt,
title = {TVLT: Textless Vision-Language Transformer},
author = {Zineng Tang and Jaemin Cho and Yixin Nie and Mohit Bansal},
booktitle = {NeurIPS},
year = {2022}
}
```## Acknowledgement
The idea of this paper is heavily inspired by [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377).
Our codebase is based on [ViLT](https://github.com/dandelin/ViLT).
We thank the authors for their open-source contributions.## Contact
Zineng Tang (zn.tang.terran@gmail.com)