https://github.com/kyegomez/visionllama

Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks" in PyTorch and Zeta
https://github.com/kyegomez/visionllama

ai deep-learning multi-modal vision-models vision-transformers vit

Last synced: 3 months ago
JSON representation

Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks" in PyTorch and Zeta

Host: GitHub
URL: https://github.com/kyegomez/visionllama
Owner: kyegomez
License: mit
Created: 2024-03-04T05:29:05.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-11-11T20:18:35.000Z (11 months ago)
Last Synced: 2025-06-10T05:03:19.432Z (4 months ago)
Topics: ai, deep-learning, multi-modal, vision-models, vision-transformers, vit
Language: Python
Homepage: https://discord.gg/GYbXvDGevY
Size: 2.19 MB
Stars: 16
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

          [![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)

# Vision LLama

Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks" in PyTorch and Zeta. [PAPER LINK](https://arxiv.org/abs/2403.00522)

## install

`$ pip install vision-llama`

## usage

```python

import torch

from vision_llama.main import VisionLlama

# Forward Tensor

x = torch.randn(1, 3, 224, 224)

# Create an instance of the VisionLlamaBlock model with the specified parameters

model = VisionLlama(

    dim=768, depth=12, channels=3, heads=12, num_classes=1000

)

# Print the shape of the output tensor when x is passed through the model

print(model(x))

```

# License

MIT

## Citation

```bibtex

@misc{chu2024visionllama,

    title={VisionLLaMA: A Unified LLaMA Interface for Vision Tasks}, 

    author={Xiangxiang Chu and Jianlin Su and Bo Zhang and Chunhua Shen},

    year={2024},

    eprint={2403.00522},

    archivePrefix={arXiv},

    primaryClass={cs.CV}

}

```

## todo

- [ ] Implement the AS2DRoPE rope, might just use axial rotary embeddings instead, my implementation is really bad

- [x] Implement the GSA attention, i implemented it but's bad

- [ ] Add imagenet training script with distributed

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kyegomez/visionllama

Awesome Lists containing this project

README