https://github.com/thomasvonwu/awesome-vlms-strawberry
A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.
https://github.com/thomasvonwu/awesome-vlms-strawberry
List: awesome-vlms-strawberry
llm multimodal-learning vision-language-transformer vlms
Last synced: 3 months ago
JSON representation
A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.
- Host: GitHub
- URL: https://github.com/thomasvonwu/awesome-vlms-strawberry
- Owner: ThomasVonWu
- Created: 2024-09-25T14:52:04.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-11-14T13:38:26.000Z (6 months ago)
- Last Synced: 2024-11-14T14:37:14.066Z (6 months ago)
- Topics: llm, multimodal-learning, vision-language-transformer, vlms
- Homepage:
- Size: 756 KB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- ultimate-awesome - awesome-vlms-strawberry - A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques. (Other Lists / Julia Lists)
README
# Awesome-VLMs-Strawberry
A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.## OpenAI Docs
- [https://platform.openai.com/docs/guides/reasoning](https://platform.openai.com/docs/guides/reasoning)
-## Papers
### 2024
- [***Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving[arxiv]***](https://github.com/hustvl/Senna)
- [*Code*](https://github.com/hustvl/Senna)- [***EMMA: End-to-End Multimodal Model for Autonomous Driving[arxiv]***](https://storage.googleapis.com/waymo-uploads/files/research/EMMA-paper.pdf)
- [*Code*]() ⚠️
- [*Blog*](https://waymo.com/blog/2024/10/introducing-emma/)- [***VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions [ECCV2024]***](https://arxiv.org/abs/2407.12345)
- [*Code*]() ⚠️
- [*Datasets*](https://drive.google.com/file/d/1v_M_OuLnDzRo2uXyOrDfHNHbtoIcR3RA/edit)- [***HE-Drive: Human-Like End-to-End Driving with Vision Language Models [arxiv]***](https://arxiv.org/abs/2410.05051)
- [*Code*](https://github.com/jmwang0117/HE-Drive)![]()
- [*NuScenes-Datasets-Website*](https://www.nuscenes.org/nuscenes)
- [*OpenScene-Datasets-Tutorials*](https://github.com/OpenDriveLab/OpenScene)- [***Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving [ECCV2024]***](https://arxiv.org/abs/2312.03661)
- [*Code*](https://github.com/fudan-zvg/reason2drive)![]()
- [*Datasets-GoogelDisk*](https://drive.google.com/file/d/16IInbGqEzg4UcNhTlxVA9tS6tOTi4wet/view?usp=sharing)
- [*Datasets-BaiduDisk*](https://pan.baidu.com/s/1tzAuaB42RkguYM863zo6Jw?pwd=6g94)
- [***Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving [CoRL 2024]***](https://arxiv.org/pdf/2409.06702)
- [*Code*](https://air-discover.github.io/Hint-AD/) *coming soon*
- [*Datasets*](https://air-discover.github.io/Hint-AD/) *coming soon*- [***MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving [arxiv]***](https://arxiv.org/pdf/2409.07267)
- [*Code*](https://github.com/EMZucas/minidrive) *coming soon*
- [*Drive-LM-Datasets-Tutorials*](https://github.com/OpenDriveLab/DriveLM/tree/main/challenge)
- [*CODA-LM-Datasets-Website*](https://coda-dataset.github.io/coda-lm/)
- [*CODA-LM-Datasets-Tutorials*](https://github.com/DLUT-LYZ/CODA-LM)- [***LMDrive: Closed-Loop End-to-End Driving with Large Language Models [CVPR2024]***](https://arxiv.org/abs/2312.07488)
- [*Code*](https://github.com/opendilab/LMDrive)![]()
- [*LMDrive-Datasets*](https://openxlab.org.cn/datasets/deepcs233/LMDrive)- [***CoDrivingLLM: Towards Interactive and Learnable Cooperative Driving Automation: a Large Language Model-Driven Decision-making Framework [arxiv]***](https://arxiv.org/pdf/2409.12812)
- [*Code*](https://github.com/FanGShiYuu/CoDrivingLLM)![]()
- [*Datasets*]() ⚠️- [**XGEN-MM-VID (BLIP-3-VIDEO): YOU ONLY NEED 32 TOKENS TO REPRESENT A VIDEO EVEN IN VLMS[arxiv]**](https://arxiv.org/abs/2410.16267)
- [*Code*]() ⚠️
- [*Datasets*]() ⚠️- [**VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks[arxiv]**](https://arxiv.org/abs/2406.08394)
- [*Code*](https://github.com/OpenGVLab/VisionLLM/tree/main/VisionLLMv2)- [**Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training[arxiv]**](https://arxiv.org/abs/2410.08202)
- [*Code*]() ⚠️
- [*HuggingFace*](https://huggingface.co/OpenGVLab/Mono-InternVL-2B)### 2023
- [***DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving [arxiv]***](https://arxiv.org/pdf/2409.12812)
- [*Code*](https://github.com/OpenGVLab/DriveMLM)![]()
- [*Datasets*](https://github.com/OpenGVLab/DriveMLM) *coming soon*- [***BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models***](https://arxiv.org/abs/2301.12597)
- [*Code*](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)### 2022
- [***CLIP:Learning Transferable Visual Models From Natural Language Supervision***](https://arxiv.org/pdf/2103.00020)
- [*Code*](https://github.com/openai/CLIP)- [***BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation***](https://arxiv.org/pdf/2201.12086)
- [*Code*](https://github.com/salesforce/BLIP)