An open API service indexing awesome lists of open source software.

https://github.com/thomasvonwu/awesome-vlms-strawberry

A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.
https://github.com/thomasvonwu/awesome-vlms-strawberry

List: awesome-vlms-strawberry

llm multimodal-learning vision-language-transformer vlms

Last synced: 3 months ago
JSON representation

A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.

Awesome Lists containing this project

README

        

# Awesome-VLMs-Strawberry
A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.

## OpenAI Docs
- [https://platform.openai.com/docs/guides/reasoning](https://platform.openai.com/docs/guides/reasoning)
-

## Papers

### 2024

- [***Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving[arxiv]***](https://github.com/hustvl/Senna)
- [*Code*](https://github.com/hustvl/Senna)

- [***EMMA: End-to-End Multimodal Model for Autonomous Driving[arxiv]***](https://storage.googleapis.com/waymo-uploads/files/research/EMMA-paper.pdf)
- [*Code*]() ⚠️
- [*Blog*](https://waymo.com/blog/2024/10/introducing-emma/)

- [***VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions [ECCV2024]***](https://arxiv.org/abs/2407.12345)
- [*Code*]() ⚠️
- [*Datasets*](https://drive.google.com/file/d/1v_M_OuLnDzRo2uXyOrDfHNHbtoIcR3RA/edit)

- [***HE-Drive: Human-Like End-to-End Driving with Vision Language Models [arxiv]***](https://arxiv.org/abs/2410.05051)
- [*Code*](https://github.com/jmwang0117/HE-Drive)
- [*NuScenes-Datasets-Website*](https://www.nuscenes.org/nuscenes)
- [*OpenScene-Datasets-Tutorials*](https://github.com/OpenDriveLab/OpenScene)

- [***Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving [ECCV2024]***](https://arxiv.org/abs/2312.03661)
- [*Code*](https://github.com/fudan-zvg/reason2drive)
- [*Datasets-GoogelDisk*](https://drive.google.com/file/d/16IInbGqEzg4UcNhTlxVA9tS6tOTi4wet/view?usp=sharing)
- [*Datasets-BaiduDisk*](https://pan.baidu.com/s/1tzAuaB42RkguYM863zo6Jw?pwd=6g94)

- [***Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving [CoRL 2024]***](https://arxiv.org/pdf/2409.06702)
- [*Code*](https://air-discover.github.io/Hint-AD/) *coming soon*
- [*Datasets*](https://air-discover.github.io/Hint-AD/) *coming soon*

- [***MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving [arxiv]***](https://arxiv.org/pdf/2409.07267)
- [*Code*](https://github.com/EMZucas/minidrive) *coming soon*
- [*Drive-LM-Datasets-Tutorials*](https://github.com/OpenDriveLab/DriveLM/tree/main/challenge)
- [*CODA-LM-Datasets-Website*](https://coda-dataset.github.io/coda-lm/)
- [*CODA-LM-Datasets-Tutorials*](https://github.com/DLUT-LYZ/CODA-LM)

- [***LMDrive: Closed-Loop End-to-End Driving with Large Language Models [CVPR2024]***](https://arxiv.org/abs/2312.07488)
- [*Code*](https://github.com/opendilab/LMDrive)
- [*LMDrive-Datasets*](https://openxlab.org.cn/datasets/deepcs233/LMDrive)

- [***CoDrivingLLM: Towards Interactive and Learnable Cooperative Driving Automation: a Large Language Model-Driven Decision-making Framework [arxiv]***](https://arxiv.org/pdf/2409.12812)
- [*Code*](https://github.com/FanGShiYuu/CoDrivingLLM)
- [*Datasets*]() ⚠️

- [**XGEN-MM-VID (BLIP-3-VIDEO): YOU ONLY NEED 32 TOKENS TO REPRESENT A VIDEO EVEN IN VLMS[arxiv]**](https://arxiv.org/abs/2410.16267)
- [*Code*]() ⚠️
- [*Datasets*]() ⚠️

- [**VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks[arxiv]**](https://arxiv.org/abs/2406.08394)
- [*Code*](https://github.com/OpenGVLab/VisionLLM/tree/main/VisionLLMv2)

- [**Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training[arxiv]**](https://arxiv.org/abs/2410.08202)
- [*Code*]() ⚠️
- [*HuggingFace*](https://huggingface.co/OpenGVLab/Mono-InternVL-2B)

### 2023

- [***DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving [arxiv]***](https://arxiv.org/pdf/2409.12812)
- [*Code*](https://github.com/OpenGVLab/DriveMLM)
- [*Datasets*](https://github.com/OpenGVLab/DriveMLM) *coming soon*

- [***BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models***](https://arxiv.org/abs/2301.12597)
- [*Code*](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)

### 2022

- [***CLIP:Learning Transferable Visual Models From Natural Language Supervision***](https://arxiv.org/pdf/2103.00020)
- [*Code*](https://github.com/openai/CLIP)

- [***BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation***](https://arxiv.org/pdf/2201.12086)
- [*Code*](https://github.com/salesforce/BLIP)