https://github.com/oaklight/mango

repo for paper: MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models
https://github.com/oaklight/mango

Last synced: 6 months ago
JSON representation

repo for paper: MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models

Host: GitHub
URL: https://github.com/oaklight/mango
Owner: Oaklight
Created: 2023-04-22T15:48:32.000Z (about 3 years ago)
Default Branch: camera-ready
Last Pushed: 2024-06-03T00:23:48.000Z (about 2 years ago)
Last Synced: 2024-06-05T09:13:12.689Z (about 2 years ago)
Language: Python
Homepage: https://mango.ttic.edu
Size: 3.84 MB
Stars: 1
Watchers: 4
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# MANGO

Repository for the paper: *[MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models](https://arxiv.org/abs/2403.19913)*

More details can be found on our [official website](https://mango.ttic.edu).

For questions or issues, please open an [issue on GitHub](https://github.com/Oaklight/mango/issues).

## Abstract

Large language models (LLMs), such as ChatGPT and GPT-4, have shown remarkable performance in various natural language processing tasks. In this paper, we introduce **MANGO**, a benchmark to assess the ability of LLMs to perform text-based mapping and navigation.

MANGO comprises 53 mazes from a suite of text-based games. Each maze is paired with a walkthrough that covers key locations but not all paths. The benchmark involves question-answering tasks where the LLM reads the walkthrough and answers hundreds of mapping and navigation questions, such as:

- *"How should you go to the Attic from West of House?"*
- *"Where would you be if you go north and east from Cellar?"*

While these questions are simple for humans, even the state-of-the-art model GPT-4 struggles with them. Our findings indicate that strong mapping and navigation capabilities are crucial for LLMs to perform downstream tasks, such as playing text-based games.

We host the **leaderboard**, **data**, **code**, and **evaluation** tools for MANGO [here](https://mango.ttic.edu), facilitating future research in this area.

## Setup

To set up the environment for MANGO, follow these steps:

```bash
git clone https://github.com/Oaklight/mango.git
cd mango

conda create -n mango python=3.11 -y
conda activate mango

# For evaluation
pip install -e .

# For evaluation and inference
pip install -e .[infer]
```

## Dataset

Our data is hosted on [Hugging Face](https://huggingface.co/mango-ttic). More information is available [here](https://oaklight.github.io/mgwb/data/).

To download the dataset for the first 70 moves of each game:

```bash
cd mango
wget https://huggingface.co/datasets/mango-ttic/data/resolve/main/data-70steps.tar.zst
zstd -d -c data-70steps.tar.zst | tar -xvf -
rm data-70steps.tar.zst
mv data-70steps data
```

Alternatively, the dataset is available in the `data` folder within this repository.

## Inference

The inference code is located in the `mango/inference/` directory. You can find additional details in the README file in that folder.

To query the `claude-instant-1` model for inference:

```bash
export ANTHROPIC_API_KEY=

python mango/inference/main.py --exp_tag debug --data_folder ./data --save_folder ./results --game_name '905' --task_type 'route_finding' --model_name 'claude-instant-1'
```

## Evaluation
The Evaluation script currently supports 70-step data and full data except for the game `curse` (it would be a curse on your compute).

Evaluation can be performed using the script located at `mango/evaluation/scripts/evaluate.py`.

For the required output format for destination-finding evaluation, refer to the following sample:

```
/mango/examples/llm_output_example/claude-instant-1_desti_finding_debug/905/result_sample_id_1f51a779e76851bcc0bd9a9ce26ab9145349ea63f0810d7e5357b46b45c01f82.json
```

For route-finding evaluation, refer to:

```
/mango/examples/llm_output_example/claude-instant-1_route_finding_debug/905/result_sample_id_4ac913314591fb251c6b13678324b508e5cd383638938482322bd02be1718de0.json
```

Make sure the `response` field is a list of dictionaries with the required keys, such as:

```json
[{"location_before": "driveway", "action": "north", "location_after": "living room"}, ...]
```

You can customize the key names in `mango/evaluation/config.py`. For example:

```python
"location_before": "location_before" --> "location_before": "prev_location"
```

### Evaluation Examples
Check Arguments:

```bash
mango-eval --help
```

For destination-finding:

```bash
mango-eval --mode df --rst-dir ./examples/llm_output_example/claude-instant-1_desti_finding_debug --map-dir ./data
```

For route-finding:

```bash
mango-eval --mode rf --rst-dir ./examples/llm_output_example/claude-instant-1_route_finding_debug --map-dir ./data
```

## Citation

If you use MANGO in your research, please cite our paper:

```bibtex
@misc{ding2024mango,
title={MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models},
author={Peng Ding and Jiading Fang and Peng Li and Kangrui Wang and Xiaochen Zhou and Mo Yu and Jing Li and Matthew R. Walter and Hongyuan Mei},
year={2024},
eprint={2403.19913},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/oaklight/mango

Awesome Lists containing this project

README