Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome-Foundation-Models

A curated list of foundation models for vision and language tasks
https://github.com/uncbiag/Awesome-Foundation-Models

The Evolution of Multimodal Model Architectures
Efficient Multimodal Large Language Models: A Survey
Foundation Models for Video Understanding: A Survey
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond
Prospective Role of Foundation Models in Advancing Autonomous Vehicles
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Large Multimodal Agents: A Survey
The Uncanny Valley: A Comprehensive Analysis of Diffusion Models
Real-World Robot Applications of Foundation Models: A Review
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
Towards Generalist Foundation Model for Radiology
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Towards Generalist Biomedical AI
A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models
Large Multimodal Models: Notes on CVPR 2023 Tutorial
A Survey on Multimodal Large Language Models
Vision-Language Models for Vision Tasks: A Survey
Foundation Models for Generalist Medical Artificial Intelligence
A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
Vision-language pre-training: Basics, recent advances, and future trends
On the Opportunities and Risks of Foundation Models
05/22
05/21
05/20 - models/octo.svg?style=social&label=Star)](https://github.com/octo-models/octo)
05/17
05/14
05/09 - VLLM/Lumina-T2X.svg?style=social&label=Star)](https://github.com/Alpha-VLLM/Lumina-T2X)
05/08
05/06
05/07
05/03 - ai/reka-vibe-eval.svg?style=social&label=Star)](https://github.com/reka-ai/reka-vibe-eval)
04/30
04/26
04/10
04/02
04/02
03/22
03/18
03/14
03/09 - Chapel Hill) [![Star](https://img.shields.io/github/stars/uncbiag/uniGradICON.svg?style=social&label=Star)](https://github.com/uncbiag/uniGradICON)
03/05
03/01
03/01 - AutoML/VisionLLaMA.svg?style=social&label=Star)](https://github.com/Meituan-AutoML/VisionLLaMA)
02/28 - ai-lab/Consistency_LLM.svg?style=social&label=Star)](https://github.com/hao-ai-lab/Consistency_LLM)
02/27
02/22
02/21
02/20 - HPC-AI-Lab/Neural-Network-Diffusion.svg?style=social&label=Star)](https://github.com/NUS-HPC-AI-Lab/Neural-Network-Diffusion)
02/20
02/19
02/06 - AutoML/MobileVLM.svg?style=social&label=Star)](https://github.com/Meituan-AutoML/MobileVLM)
01/30 - CVC/YOLO-World.svg?style=social&label=Star)](https://github.com/AILab-CVC/YOLO-World)
01/23
01/22 - AIMI/CheXagent.svg?style=social&label=Star)](https://github.com/Stanford-AIMI/CheXagent)
01/19 - Anything.svg?style=social&label=Star)](https://github.com/LiheYoung/Depth-Anything)
01/16
01/15
Mamba: Linear-Time Sequence Modeling with Selective State Spaces - sized Transformers while scaling linearly with sequence length. from CMU)
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
Tracking Everything Everywhere All at Once
Foundation Models for Generalist Geospatial Artificial Intelligence
LLaMA 2: Open Foundation and Fine-Tuned Chat Models
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Meta-Transformer: A Unified Framework for Multimodal Learning
Retentive Network: A Successor to Transformer for Large Language Models
Neural World Models for Computer Vision
Recognize Anything: A Strong Image Tagging Model
Towards Visual Foundation Models of Physical Scenes - purpose visual representations of physical scenes
LIMA: Less Is More for Alignment
PaLM 2 Technical Report
IMAGEBIND: One Embedding Space To Bind Them All
Visual Instruction Tuning - Madison and Microsoft) [![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social&label=Star)](https://github.com/haotian-liu/LLaVA)
SEEM: Segment Everything Everywhere All at Once - Madison, HKUST, and Microsoft) [![Star](https://img.shields.io/github/stars/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.svg?style=social&label=Star)](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)
SAM: Segment Anything - anything.svg?style=social&label=Star)](https://github.com/facebookresearch/segment-anything)
SegGPT: Segmenting Everything In Context
Images Speak in Images: A Generalist Painter for In-Context Visual Learning
UniDector: Detecting Everything in the Open World: Towards Universal Object Detection
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Visual Prompt Multi-Modal Tracking
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
EVA-CLIP: Improved Training Techniques for CLIP at Scale
EVA-02: A Visual Representation for Neon Genesis
EVA-01: Exploring the Limits of Masked Visual Representation Learning at Scale
LLaMA: Open and Efficient Foundation Language Models
The effectiveness of MAE pre-pretraining for billion-scale pretraining
BloombergGPT: A Large Language Model for Finance
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
FLIP: Scaling Language-Image Pre-training via Masking
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
GPT-4 Technical Report
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
UNINEXT: Universal Instance Perception as Object Discovery and Retrieval
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning
Unified Vision and Language Prompt Learning - ->
BEVT: BERT Pretraining of Video Transformers
Foundation Transformers
A Generalist Agent - modal, multi-task, multi-embodiment generalist agent; from DeepMind)
FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Flamingo: a Visual Language Model for Few-Shot Learning
MetaLM: Language Models are General-Purpose Interfaces
Point-E: A System for Generating 3D Point Clouds from Complex Prompts - to-image diffusion model; from OpenAI)
Image Segmentation Using Text and Image Prompts
Unifying Flow, Stereo and Depth Estimation
PaLI: A Jointly-Scaled Multilingual Language-Image Model
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
SLIP: Self-supervision meets Language-Image Pre-training
GLIPv2: Unifying Localization and VL Understanding
GLIP: Grounded Language-Image Pre-training
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
PaLM: Scaling Language Modeling with Pathways
CoCa: Contrastive Captioners are Image-Text Foundation Models
Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
A Unified Sequence Interface for Vision Tasks
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models - Bench: a 204-task extremely difficult and diverse benchmark for LLMs, 444 authors from 132 institutions)
CRIS: CLIP-Driven Referring Image Segmentation
Masked Autoencoders As Spatiotemporal Learners
Masked Autoencoders Are Scalable Vision Learners
InstructGPT: Training language models to follow instructions with human feedback
A Unified Sequence Interface for Vision Tasks
DALL-E2: Hierarchical Text-Conditional Image Generation with CLIP Latents
Robust and Efficient Medical Imaging with Self-Supervision
Video Swin Transformer
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation
FLAVA: A Foundational Language And Vision Alignment Model
Towards artificial general intelligence via a multimodal foundation model
FILIP: Fine-Grained Interactive Language-Image Pre-Training
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Unifying Vision-and-Language Tasks via Text Generation - Chapel Hill)
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
UniT: Multimodal Multitask Learning with a Unified Transformer
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training - scale Chinese multimodal pre-training model called BriVL; from Renmin University of China)
Codex: Evaluating Large Language Models Trained on Code
Florence: A New Foundation Model for Computer Vision
DALL-E: Zero-Shot Text-to-Image Generation
CLIP: Learning Transferable Visual Models From Natural Language Supervision
Multimodal Few-Shot Learning with Frozen Language Models
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale - attention blocks; ICLR, from Google)
GPT-3: Language Models are Few-Shot Learners - context learning compared with GPT-2; from OpenAI)
UNITER: UNiversal Image-TExt Representation Learning
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
GPT-2: Language Models are Unsupervised Multitask Learners
LXMERT: Learning Cross-Modality Encoder Representations from Transformers - Chapel Hill)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
GPT: Improving Language Understanding by Generative Pre-Training
Attention Is All You Need
GPT-4 Technical Report
GPT-3: Language Models are Few-Shot Learners - context learning compared with GPT-2; from OpenAI)
GPT-2: Language Models are Unsupervised Multitask Learners
GPT: Improving Language Understanding by Generative Pre-Training
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer - research/text-to-text-transfer-transformer.svg?style=social&label=Star)](https://github.com/google-research/text-to-text-transfer-transformer)
BLINK: Multimodal Large Language Models Can See but Not Perceive
CAD-Estate: Large-scale CAD Model Annotation in RGB Videos
ImageNet: A Large-Scale Hierarchical Image Database
FLIP: Scaling Language-Image Pre-training via Masking
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models - the-shelf frozen vision and language models. from Salesforce Research)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
SLIP: Self-supervision meets Language-Image Pre-training
GLIP: Grounded Language-Image Pre-training
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
RegionCLIP: Region-Based Language-Image Pretraining
CLIP: Learning Transferable Visual Models From Natural Language Supervision
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
SEEM: Segment Everything Everywhere All at Once - Madison, HKUST, and Microsoft)
SAM: Segment Anything - anything.svg?style=social&label=Star)](https://github.com/facebookresearch/segment-anything)
SegGPT: Segmenting Everything In Context
Green AI
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models
Managing Extreme AI Risks amid Rapid Progress
Awesome-Diffusion-Models - usion/Awesome-Diffusion-Models.svg?style=social&label=Star)](https://github.com/diff-usion/Awesome-Diffusion-Models)
Awesome-Video-Diffusion-Models - Video-Diffusion-Models.svg?style=social&label=Star)](https://github.com/ChenHsing/Awesome-Video-Diffusion-Models)
Awesome-Diffusion-Model-Based-Image-Editing-Methods - Diffusion-Model-Based-Image-Editing-Methods.svg?style=social&label=Star)](https://github.com/SiatMMLab/Awesome-Diffusion-Model-Based-Image-Editing-Methods)
Awesome-CV-Foundational-Models - CV-Foundational-Models.svg?style=social&label=Star)](https://github.com/awaisrauf/Awesome-CV-Foundational-Models)
Awesome-Healthcare-Foundation-Models - Qiu/Awesome-Healthcare-Foundation-Models.svg?style=social&label=Star)](https://github.com/Jianing-Qiu/Awesome-Healthcare-Foundation-Models)
awesome-large-multimodal-agents - large-multimodal-agents.svg?style=social&label=Star)](https://github.com/jun0wanan/awesome-large-multimodal-agents)
Computer Vision in the Wild (CVinW) - Vision-in-the-Wild/CVinW_Readings.svg?style=social&label=Star)](https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings)

Programming Languages

Keywords

diffusion-models 2 video-editing 1 video-diffusion-model 1 video-diffusion 1 video 1 text-to-video 1 survey 1 diffusion 1 awesome-list 1 awesome 1 score-matching 1 score-based 1 machine-learning 1 generative-model 1 artificial-intelligence 1