https://github.com/lucidrains/mirasol-pytorch

Implementation of 🌻 Mirasol, SOTA Multimodal Autoregressive model out of Google Deepmind, in Pytorch
https://github.com/lucidrains/mirasol-pytorch

artificial-intelligence attention-mechanism deep-learning multimodality transformers

Last synced: 24 days ago
JSON representation

Implementation of 🌻 Mirasol, SOTA Multimodal Autoregressive model out of Google Deepmind, in Pytorch

Host: GitHub
URL: https://github.com/lucidrains/mirasol-pytorch
Owner: lucidrains
License: mit
Created: 2023-11-18T17:16:16.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-12-22T13:45:49.000Z (almost 2 years ago)
Last Synced: 2025-08-18T08:57:17.961Z (3 months ago)
Topics: artificial-intelligence, attention-mechanism, deep-learning, multimodality, transformers
Language: Python
Homepage:
Size: 1.01 MB
Stars: 89
Watchers: 7
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          

## 🌻 Mirasol - Pytorch

Implementation of Mirasol, SOTA Multimodal Autoregressive model out of Google Deepmind, in Pytorch

Will simply implement the Transformer Combiner and omit the other variants.

## Appreciation

- StabilityAI, A16Z Open Source AI Grant Program, and 🤗 Huggingface for the generous sponsorships, as well as my other sponsors, for affording me the independence to open source current artificial intelligence research

## Install

```bash

$ pip install mirasol-pytorch

```

## Usage

```python

import torch

from mirasol_pytorch import Mirasol

model = Mirasol(

    dim = 512,

    num_text_tokens = 256,

    video_image_size = 128,

    video_frames_per_timechunk = 2,

    audio_freq_dim = 64,

    audio_time_dim_per_timechunk = 32,

    audio_patch_size = (32, 16),

    video_patch_size = (64, 2),

    audio_encoder = dict(

        dim = 512,

        depth = 2

    ),

    video_encoder = dict(

        dim = 512,

        depth = 2

    )

)

audio = torch.randn(1, 64, 1024)

video = torch.randn(1, 3, 12, 128, 128)

text = torch.randint(0, 256, (1, 1024))

loss = model(

    audio = audio,

    video = video,

    text = text

)

loss.backward()

# after much training

sampled_text = model.generate(

    audio = audio,

    video = video,

    seq_len = 512

)

```

## Todo

- [x] text generation code

- [x] auto-handle start token for decoder

- [x] positional embeddings for video and audio encoder

- [x] enable register tokens for both video and audio encoder, inline with new research

- [x] add audio and video reconstruction losses

- [x] add similarity regularization from TTS research

## Citations

```bibtex

@article{Piergiovanni2023Mirasol3BAM,

    title   = {Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities},

    author  = {A. J. Piergiovanni and Isaac Noble and Dahun Kim and Michael S. Ryoo and Victor Gomes and Anelia Angelova},

    journal = {ArXiv},

    year    = {2023},

    volume  = {abs/2311.05698},

    url     = {https://api.semanticscholar.org/CorpusID:265129010}

}

```

```bibtex

@inproceedings{Liu2022TowardsBF,

    title   = {Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models},

    author  = {Hao Liu and Xinyang Geng and Lisa Lee and Igor Mordatch and Sergey Levine and Sharan Narang and P. Abbeel},

    year    = {2022},

    url     = {https://api.semanticscholar.org/CorpusID:256416540}

}

```

```bibtex

@article{Darcet2023VisionTN,

    title   = {Vision Transformers Need Registers},

    author  = {Timoth'ee Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},

    journal = {ArXiv},

    year    = {2023},

    volume  = {abs/2309.16588},

    url     = {https://api.semanticscholar.org/CorpusID:263134283}

}

```

```bibtex

@article{Bondarenko2023QuantizableTR,

    title   = {Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing},

    author  = {Yelysei Bondarenko and Markus Nagel and Tijmen Blankevoort},

    journal = {ArXiv},

    year    = {2023},

    volume  = {abs/2306.12929},

    url     = {https://api.semanticscholar.org/CorpusID:259224568}

}

```

```bibtex

@misc{shi2023enhance,

    title   = {Enhance audio generation controllability through representation similarity regularization}, 

    author  = {Yangyang Shi and Gael Le Lan and Varun Nagaraja and Zhaoheng Ni and Xinhao Mei and Ernie Chang and Forrest Iandola and Yang Liu and Vikas Chandra},

    year    = {2023},

    eprint  = {2309.08773},

    archivePrefix = {arXiv},

    primaryClass = {cs.SD}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lucidrains/mirasol-pytorch

Awesome Lists containing this project

README