https://github.com/kyegomez/videovit
Open source implementation of a vision transformer that can understand Videos using max vit as a foundation.
https://github.com/kyegomez/videovit
attention attention-is-all-you-need attention-mechanism gpt4 multimodal vision-transformer
Last synced: 7 months ago
JSON representation
Open source implementation of a vision transformer that can understand Videos using max vit as a foundation.
- Host: GitHub
- URL: https://github.com/kyegomez/videovit
- Owner: kyegomez
- License: mit
- Created: 2023-10-08T02:49:33.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-03-11T21:33:53.000Z (over 1 year ago)
- Last Synced: 2025-04-14T05:18:06.767Z (7 months ago)
- Topics: attention, attention-is-all-you-need, attention-mechanism, gpt4, multimodal, vision-transformer
- Language: Python
- Homepage: https://discord.gg/qUtxnK2NMf
- Size: 2.18 MB
- Stars: 8
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
[](https://discord.gg/qUtxnK2NMf)
# Video Vit
Open source implementation of a vision transformer that can understand Videos using max vit as a foundation. This uses max vit as the backbone vit and then packs the video tensor into a 4d tensor which is the input to the maxvit model. Implementing this because the new McVit came out and I need more practice. This is fully ready to train and I believe would perform amazingly.
## Installation
`$ pip install video-vit`
## Usage
```python
import torch
from video_vit.main import VideoViT
# Instantiate the VideoViT model with the specified parameters
model = VideoViT(
num_classes=10, # Number of output classes
dim=64, # Dimension of the token embeddings
depth=(2, 2, 2), # Depth of each stage in the model
dim_head=32, # Dimension of the attention head
window_size=7, # Size of the attention window
mbconv_expansion_rate=4, # Expansion rate of the Mobile Inverted Bottleneck block
mbconv_shrinkage_rate=0.25, # Shrinkage rate of the Mobile Inverted Bottleneck block
dropout=0.1, # Dropout rate
channels=3, # Number of input channels
)
# Create a random tensor with shape (batch_size, channels, frames, height, width)
x = torch.randn(1, 3, 10, 224, 224)
# Perform a forward pass through the model
output = model(x)
# Print the shape of the output tensor
print(output.shape)
```
# License
MIT