https://github.com/skalskip/vlms-zero-to-hero
This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
https://github.com/skalskip/vlms-zero-to-hero
bert-model clip computer-vision embeddings gpt gpt-2 lora natural-language-processing seq2seq vision-language-model word2vec
Last synced: 22 days ago
JSON representation
This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
- Host: GitHub
- URL: https://github.com/skalskip/vlms-zero-to-hero
- Owner: SkalskiP
- License: apache-2.0
- Created: 2024-12-20T08:02:26.000Z (6 months ago)
- Default Branch: master
- Last Pushed: 2025-01-23T11:23:09.000Z (4 months ago)
- Last Synced: 2025-05-08T22:44:02.116Z (28 days ago)
- Topics: bert-model, clip, computer-vision, embeddings, gpt, gpt-2, lora, natural-language-processing, seq2seq, vision-language-model, word2vec
- Language: Jupyter Notebook
- Homepage: https://www.youtube.com/@SkalskiP
- Size: 338 KB
- Stars: 1,063
- Watchers: 46
- Forks: 97
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
VLMs zero-to-hero
coming: january 2025...
# hello
Welcome to VLMs Zero to Hero! This series will take you on a journey from the
fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.# tutorials
| **notebook** | **open in colab** | **video** | **paper** |
|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------:|:---------------------------------------:|
| 01.01. [Word2Veq: Distributed Representations of Words and Phrases and their Compositionality](https://github.com/SkalskiP/vlms-zero-to-hero/blob/master/01_natural_language_processing_fundamentals/01_01_word2vec_with_sub_sampling_and_negative_sampling_in_pytorch.ipynb) | [link](https://colab.research.google.com/github/SkalskiP/vlms-zero-to-hero/blob/master/01_natural_language_processing_fundamentals/01_01_word2vec_with_sub_sampling_and_negative_sampling_in_pytorch.ipynb) | soon | [link](https://arxiv.org/abs/1310.4546) |# roadmap
## natural language processing (NLP) fundamentals
- Word2Veq: [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781) (2013) and [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/pdf/1310.4546) (2013)
- Seq2Seq: [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/pdf/1409.3215) (2014)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762) (2017)
- BERT: [Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805) (2018)
- GPT: [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) (2018)## computer vision (CV) fundamentals
- AlexNet: [ImageNet Classification with Deep Convolutional Neural Networks](https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf) (2012)
- VGG: [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/pdf/1409.1556) (2014)
- ResNet: [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) (2015)## early vision-language models
- [Show and Tell: A Neural Image Caption Generator](https://arxiv.org/pdf/1411.4555) (2014) and [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/pdf/1502.03044) (2015)
- [A Picture is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/pdf/2010.11929) (2020)
- CLIP: [Learning Transferable Visual Models from Natural Language Supervision](https://arxiv.org/pdf/2103.00020) (2021)## scale and efficiency
- [Scaling Laws for Neural Language Models](https://arxiv.org/pdf/2001.08361) (2020)
- LoRA: [Low-Rank Adaptation of Large Language Models](https://arxiv.org/pdf/2106.09685) (2021)
- QLoRA: [Efficient Fine-tuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314) (2023)## modern vision-language models
- Flamingo: [A Visual Language Model for Few-Shot Learning](https://arxiv.org/pdf/2204.14198) (2022)
- LLaVA: [Visual Instruction Tuning](https://arxiv.org/pdf/2304.08485) (2023)
- BLIP-2: [Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/pdf/2301.12597) (2023)
- PaliGemma: [A versatile 3B VLM for transfer](https://arxiv.org/pdf/2407.07726) (2024)## extra
- BLEU: [a Method for Automatic Evaluation of Machine Translation](https://aclanthology.org/P02-1040.pdf) (2002)
# contribute and suggest more papers
Are there important papers, models, or techniques we missed? Do you have a favorite
breakthrough in vision-language research that isn't listed here? We’d love to hear
your suggestions!