Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ai4ce/llm4vpr

Can multimodal LLM help visual place recognition?
https://github.com/ai4ce/llm4vpr

llm robotics vision-and-language vision-language-model visual-place-recognition vpr

Last synced: 1 day ago
JSON representation

Can multimodal LLM help visual place recognition?

Host: GitHub
URL: https://github.com/ai4ce/llm4vpr
Owner: ai4ce
Created: 2024-06-21T03:48:53.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-06-26T07:37:04.000Z (7 months ago)
Last Synced: 2024-07-06T16:31:40.873Z (7 months ago)
Topics: llm, robotics, vision-and-language, vision-language-model, visual-place-recognition, vpr
Language: Python
Homepage: https://ai4ce.github.io/LLM4VPR/
Size: 7.92 MB
Stars: 18
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

## *Tell Me Where You Are*: Multimodal LLMs Meet Place Recognition
[Zonglin Lyu](https://zonglinl.github.io/), [Juexiao Zhang](https://juexzz.github.io/), [Mingxuan Lu](https://scholar.google.com/citations?user=m4ChlREAAAAJ&hl=en), [Yiming Li](https://yimingli-page.github.io/), [Chen Feng](https://ai4ce.github.io/)

![image](./misc/images/Teaser.jpg)

### Abstract

Large language models (LLMs) exhibit a variety of promising capabilities in robotics,
including long-horizon planning and commonsense reasoning.
However, their performance in place recognition is still underexplored.
In this work, we introduce multimodal LLMs (MLLMs) to visual place recognition (VPR),
where a robot must localize itself using visual observations.
Our key design is to use *vision-based retrieval* to propose several candidates and then leverage *language-based reasoning*
to carefully inspect each candidate for a final decision.
Specifically, we leverage the robust visual features produced by off-the-shelf vision foundation models (VFMs) to obtain several candidate locations.
We then prompt an MLLM to describe the differences between the current observation and each candidate in a pairwise manner,
and reason about the best candidate based on these descriptions. Our method is termed **LLM-VPR**.
Results on three datasets demonstrate that integrating the *general-purpose visual features* from VFMs with the *reasoning capabilities* of MLLMs
already provides an effective place recognition solution, *without any VPR-specific supervised training*.
We believe LLM-VPR can inspire new possibilities for applying and designing foundation models, i.e. VFMs, LLMs, and MLLMs,
to enhance the localization and navigation of mobile robots.

![image](./misc/images/LLM-VPR.jpg)

**🔍 Please check out [project website](https://ai4ce.github.io/LLM4VPR/) for more details.**

### Datasets

Please refer to [Anyloc](https://github.com/AnyLoc/AnyLoc) for dataset download. We included Baidu Mall, Pittsburgh30K, and Tokyo247.

### Vision Foundation Model

Please refer to [Anyloc](https://github.com/AnyLoc/AnyLoc) for Vision Foundation Model. We employed DINO-v2-GeM in their setup.

Save your Coarse retrieval results as the followng structures:

```
└──── /
├──── /
| ├──── Query.png
| ├──── Top1_True/False.png
| ├──── ...
```
If the retrieval results is correct, the set is as True, otherwise False. This **does not** intend to tell MLLMs about the results. Instead, it is easier for you to compute whether MLLM imrpoves the performance or not. The True/False will be removed when they are fed to the MLLM.

### Try Vision-Language Refiner

Write your own api keys and change the directory of the saved data in ```main.py``` and run:

```
python main.py
```

This will generate .txt files with descriptions and reasonings. It will also provide the reranked Top-K by printing it to the terminal.