awesome-foundation-and-multimodal-models
👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper + Code + Examples + Tutorials]
https://github.com/SkalskiP/awesome-foundation-and-multimodal-models
Last synced: 6 days ago
JSON representation
-
🤖 models
-
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
-  [](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)
-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
-  [](https://huggingface.co/Salesforce/blip2-opt-6.7b) [](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb)
-
OWL-ViT: Simple Open-Vocabulary Object Detection with Vision Transformers
- ](https://huggingface.co/spaces/adirik/OWL-ViT) [](https://huggingface.co/google/owlvit-base-patch32)
-
Depth Anything
- ](https://huggingface.co/spaces/LiheYoung/Depth-Anything) [](https://huggingface.co/LiheYoung/depth_anything_vitl14) [](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Depth%20Anything/Predicting_depth_in_an_image_with_Depth_Anything.ipynb)
-
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
-  [](https://huggingface.co/merve/EfficientSAM)
-
ImageBind: One Embedding Space To Bind Them All
- 
-
LLaVA: Large Language and Vision Assistant
- ](https://huggingface.co/spaces/badayvedat/LLaVA) [](https://huggingface.co/liuhaotian/llava-v1.6-34b)
-
CLIP: Learning Transferable Visual Models From Natural Language Supervision
-  [](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/how-to-use-openai-clip-classification.ipynb)
-
CogVLM: Visual Expert for Pretrained Language Models
-  [](https://huggingface.co/THUDM/CogVLM)
-
Fuyu-8B: A Multimodal Architecture for AI Agents
- ](https://huggingface.co/adept/fuyu-8b)
-
Ferret: Refer and Ground Anything Anywhere at Any Granularity
-  [](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b)
-
Segment Anything
- ](https://youtu.be/D-D6ZmadzPE) [](https://huggingface.co/spaces/radames/candle-segment-anything-wasm) [](https://huggingface.co/facebook/sam-vit-base) [](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/how-to-segment-anything-with-sam.ipynb)
-
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
-  [](https://huggingface.co/openai/whisper-large-v3) [](https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb)
-
SigLIP: Sigmoid Loss for Language Image Pre-Training
- ](https://huggingface.co/spaces/merve/compare_clip_siglip) [](https://huggingface.co/openai/clip-vit-base-patch16) [](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/SigLIP/Inference_with_(multilingual)_SigLIP%2C_a_better_CLIP_model.ipynb)
-
YOLO-World: Real-Time Open-Vocabulary Object Detection
- ](https://youtu.be/X7gKBGVz4vs) [](https://huggingface.co/spaces/SkalskiP/YOLO-World) [](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-yolo-world.ipynb)
-
MetaCLIP: Demystifying CLIP Data
-  [](https://huggingface.co/facebook/metaclip-b32-400m) [](https://colab.research.google.com/drive/1V0Rv1QQJkcolTjiwJuRsqWycROvYjOwg?usp=sharing)
-
Nougat: Neural Optical Understanding for Academic Documents
-  [](https://huggingface.co/facebook/nougat-small) [](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Nougat/Inference_with_Nougat_to_read_scientific_PDFs.ipynb)
-
Kosmos-2: Grounding Multimodal Large Language Models to the World
- ](https://huggingface.co/spaces/ydshieh/Kosmos-2) [](https://huggingface.co/microsoft/kosmos-2-patch14-224)
-
OWLv2: Scaling Open-Vocabulary Object Detection
- ](https://huggingface.co/spaces/merve/owlv2) [](https://huggingface.co/google/owlv2-base-patch16-ensemble) [](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/OWLv2/Zero_and_one_shot_object_detection_with_OWLv2.ipynb)
-
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
- ](https://youtu.be/cMa77r3YrDk) [](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo) [](https://huggingface.co/spaces/merve/Grounding_DINO_demo) [](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb)
-
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- ](https://huggingface.co/spaces/Qwen/Qwen-VL-Max) [](https://huggingface.co/Qwen/Qwen-VL)
-
Categories
Sub Categories
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
1
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
1
SigLIP: Sigmoid Loss for Language Image Pre-Training
1
Kosmos-2: Grounding Multimodal Large Language Models to the World
1
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
1
CLIP: Learning Transferable Visual Models From Natural Language Supervision
1
LLaVA: Large Language and Vision Assistant
1
Ferret: Refer and Ground Anything Anywhere at Any Granularity
1
MetaCLIP: Demystifying CLIP Data
1
Depth Anything
1
OWLv2: Scaling Open-Vocabulary Object Detection
1
ImageBind: One Embedding Space To Bind Them All
1
YOLO-World: Real-Time Open-Vocabulary Object Detection
1
Nougat: Neural Optical Understanding for Academic Documents
1
Fuyu-8B: A Multimodal Architecture for AI Agents
1
CogVLM: Visual Expert for Pretrained Language Models
1
OWL-ViT: Simple Open-Vocabulary Object Detection with Vision Transformers
1
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
1
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
1
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
1
Segment Anything
1
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
1