Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lucidrains/zorro-pytorch

Implementation of Zorro, Masked Multimodal Transformer, in Pytorch
https://github.com/lucidrains/zorro-pytorch

artificial-intelligence attention-mechanisms deep-learning masking multimodal transformers

Last synced: 2 months ago
JSON representation

Implementation of Zorro, Masked Multimodal Transformer, in Pytorch

Host: GitHub
URL: https://github.com/lucidrains/zorro-pytorch
Owner: lucidrains
License: mit
Created: 2023-01-26T16:27:01.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-10-20T01:42:08.000Z (over 1 year ago)
Last Synced: 2024-12-15T18:08:20.074Z (2 months ago)
Topics: artificial-intelligence, attention-mechanisms, deep-learning, masking, multimodal, transformers
Language: Python
Homepage:
Size: 197 KB
Stars: 94
Watchers: 11
Forks: 6
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        

## Zorro - Pytorch

Implementation of Zorro, Masked Multimodal Transformer, in Pytorch. This is a Deepmind work that claims a special masking strategy within a transformer help them achieve SOTA on a few multimodal benchmarks.

## Appreciation

- Stability.ai for the generous sponsorship to work and open source cutting edge artificial intelligence research

## Install

```bash

$ pip install zorro-pytorch

```

## Usage

```python

import torch

from zorro_pytorch import Zorro, TokenTypes as T

model = Zorro(

    dim = 512,                        # model dimensions

    depth = 6,                        # depth

    dim_head = 64,                    # attention dimension heads

    heads = 8,                        # attention heads

    ff_mult = 4,                      # feedforward multiple

    num_fusion_tokens = 16,           # number of fusion tokens

    audio_patch_size = 16,            # audio patch size, can also be Tuple[int, int]

    video_patch_size = 16,            # video patch size, can also be Tuple[int, int]

    video_temporal_patch_size = 2,    # video temporal patch size

    video_channels = 3,               # video channels

    return_token_types = (

        T.AUDIO,

        T.AUDIO,

        T.FUSION,

        T.GLOBAL,

        T.VIDEO,

        T.VIDEO,

        T.VIDEO,

    ) # say you want to return 2 tokens for audio, 1 token for fusion, 3 for video - for whatever self-supervised learning, supervised learning, etc etc

)

video = torch.randn(2, 3, 8, 32, 32) # (batch, channels, time, height, width)

audio = torch.randn(2, 1024 * 10)    # (batch, time)

return_tokens = model(audio = audio, video = video) # (2, 6, 512) - all 6 tokes as indicated above is returned

# say you only want 1 audio and 1 video token, for contrastive learning

audio_token, video_token = model(audio = audio, video = video, return_token_indices = (0, 3)).unbind(dim = -2) # (2, 512), (2, 512)

```

## Citations

```bibtex

@inproceedings{Recasens2023ZorroTM,

  title  = {Zorro: the masked multimodal transformer},

  author = {Adri{\`a} Recasens and Jason Lin and Jo{\~a}o Carreira and Drew Jaegle and Luyu Wang and Jean-Baptiste Alayrac and Pauline Luc and Antoine Miech and Lucas Smaira and Ross Hemsley and Andrew Zisserman},

  year   = {2023}

}

```