https://github.com/shreydan/simplevlm
building a simple VLM. Combining LlaMA-SmolLM2 + SigLip2
https://github.com/shreydan/simplevlm
computer-vision deep-learning finetuning-llms finetuning-vision-models huggingface llm multimodal nlp pytorch transformers vision-language-model vlm
Last synced: 9 months ago
JSON representation
building a simple VLM. Combining LlaMA-SmolLM2 + SigLip2
- Host: GitHub
- URL: https://github.com/shreydan/simplevlm
- Owner: shreydan
- Created: 2025-04-21T07:36:17.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-04-21T07:38:23.000Z (9 months ago)
- Last Synced: 2025-04-21T08:36:37.918Z (9 months ago)
- Topics: computer-vision, deep-learning, finetuning-llms, finetuning-vision-models, huggingface, llm, multimodal, nlp, pytorch, transformers, vision-language-model, vlm
- Language: Jupyter Notebook
- Homepage:
- Size: 7.33 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Vision Language Model
- Built by implementing LLaMA from scratch with sft-weights from SmolLM2, and siglip2 as vision encoder.
- Vision projector is a LlaMA MLP style projection block (~3M params)
- Total model size: 230M
> See `eval.ipynb` for inference
# Backbone
```
text-backbone: HuggingFaceTB/SmolLM2-135M-Instruct
vision-backbone: google/siglip2-base-patch16-224
```
# Config
```
config
embed_dim = 576
intermediate_dim = 1536
max_position_embeddings = 8192
base_theta = 100000
num_q_heads = 9
num_kv_heads = 3
attn_dropout = 0.
num_layers = 30
vocab_size = 49152
eos_token_id = 2
dtype = torch.bfloat16
num_image_tokens = 196
```
# Datasets
```
stage1: theblackcat102/llava-instruct-mix
stage2: openbmb/RLAIF-V-Dataset
```
- super basic training in bf16
- the results are decent