https://github.com/kyegomez/visionllama
Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks" in PyTorch and Zeta
https://github.com/kyegomez/visionllama
ai deep-learning multi-modal vision-models vision-transformers vit
Last synced: 3 months ago
JSON representation
Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks" in PyTorch and Zeta
- Host: GitHub
- URL: https://github.com/kyegomez/visionllama
- Owner: kyegomez
- License: mit
- Created: 2024-03-04T05:29:05.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-11T20:18:35.000Z (11 months ago)
- Last Synced: 2025-06-10T05:03:19.432Z (4 months ago)
- Topics: ai, deep-learning, multi-modal, vision-models, vision-transformers, vit
- Language: Python
- Homepage: https://discord.gg/GYbXvDGevY
- Size: 2.19 MB
- Stars: 16
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
[](https://discord.gg/qUtxnK2NMf)
# Vision LLama
Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks" in PyTorch and Zeta. [PAPER LINK](https://arxiv.org/abs/2403.00522)## install
`$ pip install vision-llama`## usage
```pythonimport torch
from vision_llama.main import VisionLlama# Forward Tensor
x = torch.randn(1, 3, 224, 224)# Create an instance of the VisionLlamaBlock model with the specified parameters
model = VisionLlama(
dim=768, depth=12, channels=3, heads=12, num_classes=1000
)# Print the shape of the output tensor when x is passed through the model
print(model(x))```
# License
MIT## Citation
```bibtex
@misc{chu2024visionllama,
title={VisionLLaMA: A Unified LLaMA Interface for Vision Tasks},
author={Xiangxiang Chu and Jianlin Su and Bo Zhang and Chunhua Shen},
year={2024},
eprint={2403.00522},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```## todo
- [ ] Implement the AS2DRoPE rope, might just use axial rotary embeddings instead, my implementation is really bad
- [x] Implement the GSA attention, i implemented it but's bad
- [ ] Add imagenet training script with distributed