https://github.com/kyegomez/palm2-vadapter

Implementation of "PaLM2-VAdapter:" from the multi-modal model paper: "PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter"
https://github.com/kyegomez/palm2-vadapter

ai attention attention-is-all-you-need attention-mechanisms deeplearning ml models multi-modal neural-nets transformers

Last synced: 5 months ago
JSON representation

Implementation of "PaLM2-VAdapter:" from the multi-modal model paper: "PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter"

Host: GitHub
URL: https://github.com/kyegomez/palm2-vadapter
Owner: kyegomez
License: mit
Created: 2024-02-19T18:32:10.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-11-11T14:37:10.000Z (11 months ago)
Last Synced: 2025-04-19T20:16:58.496Z (6 months ago)
Topics: ai, attention, attention-is-all-you-need, attention-mechanisms, deeplearning, ml, models, multi-modal, neural-nets, transformers
Language: Python
Homepage: https://discord.gg/GYbXvDGevY
Size: 2.17 MB
Stars: 17
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

          [![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)

# Palm2 Adapter

Implementation of "PaLM2-VAdapter:" from the multi-modal model paper: "PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter".

This model uses a perceiver resampler with a depth of 1 + a tiny palm to efficiently learn the features behind the images and then map them  to the same space as the big model.

## install

`$ pip install palm-vadapter`

## usage

```python

import torch

from palm_vadapter.main import PaLM2VAdapter

# Random text and image tensors

text = torch.randint(0, 1000, (1, 32), dtype=torch.long)

# Image tensor

img = torch.randn(1, 3, 224, 224)

# Initialize PaLM2VAdapter model

model = PaLM2VAdapter(

    tiny_dim=512,

    dim=512,

    num_tokens=10000,

    seq_length=32,

    depth=6,

    heads=8,

    image_size=224,

    patch_size=16,

)

# Forward pass through the model

out = model(text, img)

# Print the shape of the output

print(out.shape)

```

# License

MIT

## Citation

```bibtex

@misc{xiao2024palm2vadapter,

    title={PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter}, 

    author={Junfei Xiao and Zheng Xu and Alan Yuille and Shen Yan and Boyu Wang},

    year={2024},

    eprint={2402.10896},

    archivePrefix={arXiv},

    primaryClass={cs.CV}

}

```

## Todo

- [ ] Add video processing for every frame

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kyegomez/palm2-vadapter

Awesome Lists containing this project

README