https://github.com/thomasvonwu/awesome-vlms-strawberry

A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.
https://github.com/thomasvonwu/awesome-vlms-strawberry

List: awesome-vlms-strawberry

llm multimodal-learning vision-language-transformer vlms

Last synced: 4 months ago
JSON representation

A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.

Host: GitHub
URL: https://github.com/thomasvonwu/awesome-vlms-strawberry
Owner: ThomasVonWu
Created: 2024-09-25T14:52:04.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-11-14T13:38:26.000Z (7 months ago)
Last Synced: 2024-11-14T14:37:14.066Z (7 months ago)
Topics: llm, multimodal-learning, vision-language-transformer, vlms
Homepage:
Size: 756 KB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

ultimate-awesome - awesome-vlms-strawberry - A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques. (Other Lists / Julia Lists)

README

        # Awesome-VLMs-Strawberry

A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.

## OpenAI Docs

- [https://platform.openai.com/docs/guides/reasoning](https://platform.openai.com/docs/guides/reasoning)

- 

## Papers

### 2024

- [***Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving[arxiv]***](https://github.com/hustvl/Senna)

    - [*Code*](https://github.com/hustvl/Senna) 

- [***EMMA: End-to-End Multimodal Model for Autonomous Driving[arxiv]***](https://storage.googleapis.com/waymo-uploads/files/research/EMMA-paper.pdf)

    - [*Code*]() ⚠️  

    - [*Blog*](https://waymo.com/blog/2024/10/introducing-emma/)

- [***VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions [ECCV2024]***](https://arxiv.org/abs/2407.12345)  

    - [*Code*]() ⚠️  

    - [*Datasets*](https://drive.google.com/file/d/1v_M_OuLnDzRo2uXyOrDfHNHbtoIcR3RA/edit)

- [***HE-Drive: Human-Like End-to-End Driving with Vision Language Models [arxiv]***](https://arxiv.org/abs/2410.05051)

    - [*Code*](https://github.com/jmwang0117/HE-Drive) 

    - [*NuScenes-Datasets-Website*](https://www.nuscenes.org/nuscenes)

    - [*OpenScene-Datasets-Tutorials*](https://github.com/OpenDriveLab/OpenScene)

- [***Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving [ECCV2024]***](https://arxiv.org/abs/2312.03661)

    - [*Code*](https://github.com/fudan-zvg/reason2drive) 

    - [*Datasets-GoogelDisk*](https://drive.google.com/file/d/16IInbGqEzg4UcNhTlxVA9tS6tOTi4wet/view?usp=sharing)

    - [*Datasets-BaiduDisk*](https://pan.baidu.com/s/1tzAuaB42RkguYM863zo6Jw?pwd=6g94)

    

- [***Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving [CoRL 2024]***](https://arxiv.org/pdf/2409.06702)

    - [*Code*](https://air-discover.github.io/Hint-AD/) *coming soon*

    - [*Datasets*](https://air-discover.github.io/Hint-AD/) *coming soon*

- [***MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving [arxiv]***](https://arxiv.org/pdf/2409.07267)

    - [*Code*](https://github.com/EMZucas/minidrive) *coming soon*

    - [*Drive-LM-Datasets-Tutorials*](https://github.com/OpenDriveLab/DriveLM/tree/main/challenge)   

    - [*CODA-LM-Datasets-Website*](https://coda-dataset.github.io/coda-lm/)  

    - [*CODA-LM-Datasets-Tutorials*](https://github.com/DLUT-LYZ/CODA-LM)  

- [***LMDrive: Closed-Loop End-to-End Driving with Large Language Models [CVPR2024]***](https://arxiv.org/abs/2312.07488)

    - [*Code*](https://github.com/opendilab/LMDrive) 

    - [*LMDrive-Datasets*](https://openxlab.org.cn/datasets/deepcs233/LMDrive)

- [***CoDrivingLLM: Towards Interactive and Learnable Cooperative Driving Automation: a Large Language Model-Driven Decision-making Framework [arxiv]***](https://arxiv.org/pdf/2409.12812)  

    - [*Code*](https://github.com/FanGShiYuu/CoDrivingLLM) 

    - [*Datasets*]() ⚠️

- [**XGEN-MM-VID (BLIP-3-VIDEO): YOU ONLY NEED 32 TOKENS TO REPRESENT A VIDEO EVEN IN VLMS[arxiv]**](https://arxiv.org/abs/2410.16267)

    - [*Code*]() ⚠️

    - [*Datasets*]() ⚠️

- [**VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks[arxiv]**](https://arxiv.org/abs/2406.08394)  

    - [*Code*](https://github.com/OpenGVLab/VisionLLM/tree/main/VisionLLMv2) 

- [**Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training[arxiv]**](https://arxiv.org/abs/2410.08202)  

    - [*Code*]() ⚠️

    - [*HuggingFace*](https://huggingface.co/OpenGVLab/Mono-InternVL-2B)

### 2023

- [***DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving [arxiv]***](https://arxiv.org/pdf/2409.12812)

    - [*Code*](https://github.com/OpenGVLab/DriveMLM) 

    - [*Datasets*](https://github.com/OpenGVLab/DriveMLM) *coming soon*

- [***BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models***](https://arxiv.org/abs/2301.12597)

    -  [*Code*](https://github.com/salesforce/LAVIS/tree/main/projects/blip2) 

### 2022

- [***CLIP:Learning Transferable Visual Models From Natural Language Supervision***](https://arxiv.org/pdf/2103.00020)

    - [*Code*](https://github.com/openai/CLIP) 

- [***BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation***](https://arxiv.org/pdf/2201.12086)

    -  [*Code*](https://github.com/salesforce/BLIP)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thomasvonwu/awesome-vlms-strawberry

Awesome Lists containing this project

README