awesome-vlms-strawberry
A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.
https://github.com/thomasvonwu/awesome-vlms-strawberry
Last synced: 3 days ago
JSON representation
-
Papers
-
2024
- *Code*
- ***EMMA: End-to-End Multimodal Model for Autonomous Driving[arxiv
- *Blog*
- ***VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions [ECCV2024
- *Datasets*
- *Datasets-BaiduDisk*
- ***Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving [CoRL 2024
- *Datasets-GoogelDisk*
- ***Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving[arxiv
- ***EMMA: End-to-End Multimodal Model for Autonomous Driving[arxiv
- *Blog*
- ***VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions [ECCV2024
- *Datasets*
- ***HE-Drive: Human-Like End-to-End Driving with Vision Language Models [arxiv
- *Code* - Drive.svg"/>
- ***HE-Drive: Human-Like End-to-End Driving with Vision Language Models [arxiv
- *Code* - Drive.svg"/>
- *NuScenes-Datasets-Website*
- *Code* - zvg/reason2drive.svg"/>
- *NuScenes-Datasets-Website*
- *OpenScene-Datasets-Tutorials*
- ***Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving [ECCV2024
- *Code*
- **XGEN-MM-VID (BLIP-3-VIDEO): YOU ONLY NEED 32 TOKENS TO REPRESENT A VIDEO EVEN IN VLMS[arxiv
- *OpenScene-Datasets-Tutorials*
- ***Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving [ECCV2024
- *Code* - zvg/reason2drive.svg"/>
- *Datasets-GoogelDisk*
- *Datasets-BaiduDisk*
- ***Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving [CoRL 2024
- *Datasets*
- ***MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving [arxiv
- *Code*
- *Drive-LM-Datasets-Tutorials*
- *CODA-LM-Datasets-Website*
- *CODA-LM-Datasets-Tutorials*
- *Drive-LM-Datasets-Tutorials*
- *CODA-LM-Datasets-Website*
- *CODA-LM-Datasets-Tutorials*
- ***LMDrive: Closed-Loop End-to-End Driving with Large Language Models [CVPR2024
- *Code*
- *LMDrive-Datasets*
- ***LMDrive: Closed-Loop End-to-End Driving with Large Language Models [CVPR2024
- *Code*
- *LMDrive-Datasets*
- *Code*
- **XGEN-MM-VID (BLIP-3-VIDEO): YOU ONLY NEED 32 TOKENS TO REPRESENT A VIDEO EVEN IN VLMS[arxiv
- **VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks[arxiv
- **VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks[arxiv
- *Code*
- *Code*
- **Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training[arxiv
- *HuggingFace*
- **Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training[arxiv
- *HuggingFace*
- *Datasets*
-
2023
- *Datasets*
- ***BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models***
- *Code*
- *Datasets*
- ***BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models***
- *Code*
- ***DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving [arxiv
- ***DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving [arxiv
-
2022
- ***CLIP:Learning Transferable Visual Models From Natural Language Supervision***
- *Code*
- ***BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation***
- *Code*
- ***CLIP:Learning Transferable Visual Models From Natural Language Supervision***
- *Code*
- ***BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation***
- *Code*
-
-
OpenAI Docs
Programming Languages
Categories
Keywords