https://github.com/shreydan/simplevlm

building a simple VLM. Combining LlaMA-SmolLM2 + SigLip2
https://github.com/shreydan/simplevlm

computer-vision deep-learning finetuning-llms finetuning-vision-models huggingface llm multimodal nlp pytorch transformers vision-language-model vlm

Last synced: 9 months ago
JSON representation

building a simple VLM. Combining LlaMA-SmolLM2 + SigLip2

Host: GitHub
URL: https://github.com/shreydan/simplevlm
Owner: shreydan
Created: 2025-04-21T07:36:17.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-04-21T07:38:23.000Z (9 months ago)
Last Synced: 2025-04-21T08:36:37.918Z (9 months ago)
Topics: computer-vision, deep-learning, finetuning-llms, finetuning-vision-models, huggingface, llm, multimodal, nlp, pytorch, transformers, vision-language-model, vlm
Language: Jupyter Notebook
Homepage:
Size: 7.33 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Vision Language Model

- Built by implementing LLaMA from scratch with sft-weights from SmolLM2, and siglip2 as vision encoder.
- Vision projector is a LlaMA MLP style projection block (~3M params)
- Total model size: 230M

> See `eval.ipynb` for inference

# Backbone
```
text-backbone: HuggingFaceTB/SmolLM2-135M-Instruct
vision-backbone: google/siglip2-base-patch16-224
```

# Config
```
config
embed_dim = 576
intermediate_dim = 1536
max_position_embeddings = 8192
base_theta = 100000
num_q_heads = 9
num_kv_heads = 3
attn_dropout = 0.
num_layers = 30
vocab_size = 49152
eos_token_id = 2
dtype = torch.bfloat16
num_image_tokens = 196
```

# Datasets
```
stage1: theblackcat102/llava-instruct-mix
stage2: openbmb/RLAIF-V-Dataset
```

- super basic training in bf16
- the results are decent

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shreydan/simplevlm

Awesome Lists containing this project

README