Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-Foundation-Models
A curated list of foundation models for vision and language tasks
https://github.com/uncbiag/Awesome-Foundation-Models
- The Evolution of Multimodal Model Architectures
- Efficient Multimodal Large Language Models: A Survey
- Foundation Models for Video Understanding: A Survey
- Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond
- Prospective Role of Foundation Models in Advancing Autonomous Vehicles
- Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
- A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
- Large Multimodal Agents: A Survey
- The Uncanny Valley: A Comprehensive Analysis of Diffusion Models
- Real-World Robot Applications of Foundation Models: A Review
- From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
- Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
- Multimodal Foundation Models: From Specialists to General-Purpose Assistants
- Towards Generalist Foundation Model for Radiology
- Foundational Models Defining a New Era in Vision: A Survey and Outlook
- Towards Generalist Biomedical AI
- A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models
- Large Multimodal Models: Notes on CVPR 2023 Tutorial
- A Survey on Multimodal Large Language Models
- Vision-Language Models for Vision Tasks: A Survey
- Foundation Models for Generalist Medical Artificial Intelligence
- A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
- A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
- Vision-language pre-training: Basics, recent advances, and future trends
- On the Opportunities and Risks of Foundation Models
- 05/22
- 05/21
- 05/20 - models/octo.svg?style=social&label=Star)](https://github.com/octo-models/octo)
- 05/17
- 05/14
- 05/09 - VLLM/Lumina-T2X.svg?style=social&label=Star)](https://github.com/Alpha-VLLM/Lumina-T2X)
- 05/08
- 05/06
- 05/07
- 05/03 - ai/reka-vibe-eval.svg?style=social&label=Star)](https://github.com/reka-ai/reka-vibe-eval)
- 04/30
- 04/26
- 04/10
- 04/02
- 04/02
- 03/22
- 03/18
- 03/14
- 03/09 - Chapel Hill) [![Star](https://img.shields.io/github/stars/uncbiag/uniGradICON.svg?style=social&label=Star)](https://github.com/uncbiag/uniGradICON)
- 03/05
- 03/01
- 03/01 - AutoML/VisionLLaMA.svg?style=social&label=Star)](https://github.com/Meituan-AutoML/VisionLLaMA)
- 02/28 - ai-lab/Consistency_LLM.svg?style=social&label=Star)](https://github.com/hao-ai-lab/Consistency_LLM)
- 02/27
- 02/22
- 02/21
- 02/20 - HPC-AI-Lab/Neural-Network-Diffusion.svg?style=social&label=Star)](https://github.com/NUS-HPC-AI-Lab/Neural-Network-Diffusion)
- 02/20
- 02/19
- 02/06 - AutoML/MobileVLM.svg?style=social&label=Star)](https://github.com/Meituan-AutoML/MobileVLM)
- 01/30 - CVC/YOLO-World.svg?style=social&label=Star)](https://github.com/AILab-CVC/YOLO-World)
- 01/23
- 01/22 - AIMI/CheXagent.svg?style=social&label=Star)](https://github.com/Stanford-AIMI/CheXagent)
- 01/19 - Anything.svg?style=social&label=Star)](https://github.com/LiheYoung/Depth-Anything)
- 01/16
- 01/15
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces - sized Transformers while scaling linearly with sequence length. from CMU)
- FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
- Tracking Everything Everywhere All at Once
- Foundation Models for Generalist Geospatial Artificial Intelligence
- LLaMA 2: Open Foundation and Fine-Tuned Chat Models
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
- The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
- Meta-Transformer: A Unified Framework for Multimodal Learning
- Retentive Network: A Successor to Transformer for Large Language Models
- Neural World Models for Computer Vision
- Recognize Anything: A Strong Image Tagging Model
- Towards Visual Foundation Models of Physical Scenes - purpose visual representations of physical scenes
- LIMA: Less Is More for Alignment
- PaLM 2 Technical Report
- IMAGEBIND: One Embedding Space To Bind Them All
- Visual Instruction Tuning - Madison and Microsoft) [![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social&label=Star)](https://github.com/haotian-liu/LLaVA)
- SEEM: Segment Everything Everywhere All at Once - Madison, HKUST, and Microsoft) [![Star](https://img.shields.io/github/stars/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.svg?style=social&label=Star)](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)
- SAM: Segment Anything - anything.svg?style=social&label=Star)](https://github.com/facebookresearch/segment-anything)
- SegGPT: Segmenting Everything In Context
- Images Speak in Images: A Generalist Painter for In-Context Visual Learning
- UniDector: Detecting Everything in the Open World: Towards Universal Object Detection
- Unmasked Teacher: Towards Training-Efficient Video Foundation Models
- Visual Prompt Multi-Modal Tracking
- Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
- EVA-CLIP: Improved Training Techniques for CLIP at Scale
- EVA-02: A Visual Representation for Neon Genesis
- EVA-01: Exploring the Limits of Masked Visual Representation Learning at Scale
- LLaMA: Open and Efficient Foundation Language Models
- The effectiveness of MAE pre-pretraining for billion-scale pretraining
- BloombergGPT: A Large Language Model for Finance
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
- FLIP: Scaling Language-Image Pre-training via Masking
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- GPT-4 Technical Report
- Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
- UNINEXT: Universal Instance Perception as Object Discovery and Retrieval
- InternVideo: General Video Foundation Models via Generative and Discriminative Learning
- InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
- BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning
- Unified Vision and Language Prompt Learning - ->
- BEVT: BERT Pretraining of Video Transformers
- Foundation Transformers
- A Generalist Agent - modal, multi-task, multi-embodiment generalist agent; from DeepMind)
- FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
- Flamingo: a Visual Language Model for Few-Shot Learning
- MetaLM: Language Models are General-Purpose Interfaces
- Point-E: A System for Generating 3D Point Clouds from Complex Prompts - to-image diffusion model; from OpenAI)
- Image Segmentation Using Text and Image Prompts
- Unifying Flow, Stereo and Depth Estimation
- PaLI: A Jointly-Scaled Multilingual Language-Image Model
- VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
- SLIP: Self-supervision meets Language-Image Pre-training
- GLIPv2: Unifying Localization and VL Understanding
- GLIP: Grounded Language-Image Pre-training
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
- PaLM: Scaling Language Modeling with Pathways
- CoCa: Contrastive Captioners are Image-Text Foundation Models
- Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
- A Unified Sequence Interface for Vision Tasks
- Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
- Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models - Bench: a 204-task extremely difficult and diverse benchmark for LLMs, 444 authors from 132 institutions)
- CRIS: CLIP-Driven Referring Image Segmentation
- Masked Autoencoders As Spatiotemporal Learners
- Masked Autoencoders Are Scalable Vision Learners
- InstructGPT: Training language models to follow instructions with human feedback
- A Unified Sequence Interface for Vision Tasks
- DALL-E2: Hierarchical Text-Conditional Image Generation with CLIP Latents
- Robust and Efficient Medical Imaging with Self-Supervision
- Video Swin Transformer
- OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
- Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation
- FLAVA: A Foundational Language And Vision Alignment Model
- Towards artificial general intelligence via a multimodal foundation model
- FILIP: Fine-Grained Interactive Language-Image Pre-Training
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
- Unifying Vision-and-Language Tasks via Text Generation - Chapel Hill)
- ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- UniT: Multimodal Multitask Learning with a Unified Transformer
- WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training - scale Chinese multimodal pre-training model called BriVL; from Renmin University of China)
- Codex: Evaluating Large Language Models Trained on Code
- Florence: A New Foundation Model for Computer Vision
- DALL-E: Zero-Shot Text-to-Image Generation
- CLIP: Learning Transferable Visual Models From Natural Language Supervision
- Multimodal Few-Shot Learning with Frozen Language Models
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale - attention blocks; ICLR, from Google)
- GPT-3: Language Models are Few-Shot Learners - context learning compared with GPT-2; from OpenAI)
- UNITER: UNiversal Image-TExt Representation Learning
- T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- GPT-2: Language Models are Unsupervised Multitask Learners
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers - Chapel Hill)
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- GPT: Improving Language Understanding by Generative Pre-Training
- Attention Is All You Need
- GPT-4 Technical Report
- GPT-3: Language Models are Few-Shot Learners - context learning compared with GPT-2; from OpenAI)
- GPT-2: Language Models are Unsupervised Multitask Learners
- GPT: Improving Language Understanding by Generative Pre-Training
- T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer - research/text-to-text-transfer-transformer.svg?style=social&label=Star)](https://github.com/google-research/text-to-text-transfer-transformer)
- BLINK: Multimodal Large Language Models Can See but Not Perceive
- CAD-Estate: Large-scale CAD Model Annotation in RGB Videos
- ImageNet: A Large-Scale Hierarchical Image Database
- FLIP: Scaling Language-Image Pre-training via Masking
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models - the-shelf frozen vision and language models. from Salesforce Research)
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- SLIP: Self-supervision meets Language-Image Pre-training
- GLIP: Grounded Language-Image Pre-training
- ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- RegionCLIP: Region-Based Language-Image Pretraining
- CLIP: Learning Transferable Visual Models From Natural Language Supervision
- FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
- SEEM: Segment Everything Everywhere All at Once - Madison, HKUST, and Microsoft)
- SAM: Segment Anything - anything.svg?style=social&label=Star)](https://github.com/facebookresearch/segment-anything)
- SegGPT: Segmenting Everything In Context
- Green AI
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
- Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models
- Managing Extreme AI Risks amid Rapid Progress
- Awesome-Diffusion-Models - usion/Awesome-Diffusion-Models.svg?style=social&label=Star)](https://github.com/diff-usion/Awesome-Diffusion-Models)
- Awesome-Video-Diffusion-Models - Video-Diffusion-Models.svg?style=social&label=Star)](https://github.com/ChenHsing/Awesome-Video-Diffusion-Models)
- Awesome-Diffusion-Model-Based-Image-Editing-Methods - Diffusion-Model-Based-Image-Editing-Methods.svg?style=social&label=Star)](https://github.com/SiatMMLab/Awesome-Diffusion-Model-Based-Image-Editing-Methods)
- Awesome-CV-Foundational-Models - CV-Foundational-Models.svg?style=social&label=Star)](https://github.com/awaisrauf/Awesome-CV-Foundational-Models)
- Awesome-Healthcare-Foundation-Models - Qiu/Awesome-Healthcare-Foundation-Models.svg?style=social&label=Star)](https://github.com/Jianing-Qiu/Awesome-Healthcare-Foundation-Models)
- awesome-large-multimodal-agents - large-multimodal-agents.svg?style=social&label=Star)](https://github.com/jun0wanan/awesome-large-multimodal-agents)
- Computer Vision in the Wild (CVinW) - Vision-in-the-Wild/CVinW_Readings.svg?style=social&label=Star)](https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings)