https://github.com/kyegomez/brave-vit-swarm

Implementation of the paper: "BRAVE : Broadening the visual encoding of vision-language models"
https://github.com/kyegomez/brave-vit-swarm

Last synced: 7 months ago
JSON representation

Implementation of the paper: "BRAVE : Broadening the visual encoding of vision-language models"

Host: GitHub
URL: https://github.com/kyegomez/brave-vit-swarm
Owner: kyegomez
License: mit
Created: 2024-04-11T18:47:20.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-04-19T12:55:30.000Z (7 months ago)
Last Synced: 2025-04-19T20:17:00.710Z (7 months ago)
Language: Python
Size: 205 KB
Stars: 26
Watchers: 2
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

          [![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)

# BRAVE or Swarms of Vision Transformers

Implementation of the paper: "BRAVE : Broadening the visual encoding of vision-language models". BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation.

## install

`pip3 install brave-torch`

## usage

`pip3 install brave-torch`

## `LLM`

- A fully ready to train LLM with the Swarm of Vits + MEQFormer

```python

import torch  # Importing the torch library

from brave_torch.llm import LLM  # Importing the LLM class from brave_torch.llm module

x = torch.randint(0, 256, (1, 1000))  # Generating a random tensor 'x' with values between 0 and 256

img = torch.randn(1, 3, 256, 256)  # Generating a random image tensor 'img' with shape (1, 3, 256, 256)

model = LLM(

    dim=512,  # Dimension of the model

    depth=1,  # Depth of the model

    num_tokens=256,  # Number of tokens

    dim_head=64,  # Dimension of the attention head

    heads=8,  # Number of attention heads

    ff_mult=4,  # Multiplier for the feed-forward network dimension

    image_size=256,  # Size of the input image

    patch_size=32,  # Size of the image patch

    encoder_dim=512,  # Dimension of the encoder

    encoder_depth=6,  # Depth of the encoder

    encoder_heads=8,  # Number of attention heads in the encoder

    num_of_vits=4,  # Number of ViTs (Vision Transformers)

)

out = model(x, img)  # Forward pass through the model

print(out.shape)  # Printing the shape of the output tensor

```

### `BraveMultiModalFusion`

- The Swarm of ViTs coupled with the meqformer 

```python

import torch  # Importing the torch library for deep learning operations

from brave_torch.main import (

    BraveMultiModalFusion,

)  # Importing the BraveMultiModalFusion class from brave_torch.main module

x = torch.randn(

    1, 1000, 512

)  # Generating a random tensor of shape (1, 1000, 512) using torch.randn

img = torch.randn(

    1, 3, 256, 256

)  # Generating a random tensor of shape (1, 3, 256, 256) using torch.randn

model = BraveMultiModalFusion(

    dim=512,  # Dimension of the model

    mult=4,  # Multiplier for the dimension

    depth=1,  # Depth of the model

    dropout=0.1,  # Dropout rate

    heads=8,  # Number of attention heads

    image_size=256,  # Size of the input image

    patch_size=32,  # Size of the image patches

    encoder_dim=512,  # Dimension of the encoder

    encoder_depth=6,  # Depth of the encoder

    encoder_heads=8,  # Number of attention heads in the encoder

    num_of_vits=4,  # Number of ViTs (Vision Transformers)

)

out = model(

    x, img

)  # Forward pass through the model to get the output

print(out)  # Printing the output

```

# Citations

## Todo

- [ ] Citation link

- [ ] Citation Bibtex

- [ ] Diagram photo

- [ ] Implement Andromeda Base LLM architecture

- [ ] Provide multi-modal tokenizer

- [ ] Train and release the model

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kyegomez/brave-vit-swarm

Awesome Lists containing this project

README