https://github.com/balrog-ai/BALROG

Benchmarking Agentic LLM and VLM Reasoning On Games
https://github.com/balrog-ai/BALROG

agentic benchmark games llm reasoning vlm

Last synced: 4 months ago
JSON representation

Benchmarking Agentic LLM and VLM Reasoning On Games

Host: GitHub
URL: https://github.com/balrog-ai/BALROG
Owner: balrog-ai
License: mit
Created: 2024-11-20T22:08:12.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-01-03T12:33:58.000Z (4 months ago)
Last Synced: 2025-01-03T13:28:08.176Z (4 months ago)
Topics: agentic, benchmark, games, llm, reasoning, vlm
Language: Python
Homepage: https://balrogai.com
Size: 1.48 MB
Stars: 83
Watchers: 3
Forks: 12
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome_ai_agents - Balrog - Benchmarking Agentic LLM and VLM Reasoning On Games (Building / Benchmarks)
awesome_ai_agents - Balrog - Benchmarking Agentic LLM and VLM Reasoning On Games (Building / Benchmarks)

README

        


  

    

  



---

# BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

BALROG is a novel benchmark evaluating agentic LLM and VLM capabilities on long-horizon interactive tasks using reinforcement learning environments. Check out how current models fare on our [leaderboard](https://balrogai.com). You can read more about BALROG in our [paper](https://arxiv.org/abs/2411.13543).

## Features

- Comprehensive evaluation of agentic abilities

- Support for both language and vision-language models

- Integration with popular AI APIs and local deployment

- Easy integration for custom agents, new environments and new models

## Installation

We advise using conda for the installation

```bash

conda create -n balrog python=3.10 -y

conda activate balrog

git clone https://github.com/balrog-ai/BALROG.git

cd BALROG

pip install -e .

balrog-post-install

```

## Docker

We have provided some docker images. Please see the [relevant README](docker/README.md).

## ⚡️ Evaluate using vLLM locally

We support running LLMs/VLMs locally using [vLLM](https://github.com/vllm-project/vllm). You can spin up a vLLM client and evaluate your agent on BALROG in the following way:

```bash

pip install vllm numpy==1.23

vllm serve meta-llama/Llama-3.2-1B-Instruct --port 8080

python eval.py \

  agent.type=naive \

  agent.max_image_history=0 \

  agent.max_history=16 \

  eval.num_workers=32 \

  client.client_name=vllm \

  client.model_id=meta-llama/Llama-3.2-1B-Instruct \

  client.base_url=http://0.0.0.0:8080/v1

```

Check out [vLLM](https://github.com/vllm-project/vllm) for more options on how to serve your models fast and efficiently.

## 🛜 Evaluate using popular APIs

We support out of the box clients for OpenAI, Anthropic and Google Gemini APIs. First set up your API key:

```bash

export OPENAI_API_KEY=

export ANTHROPIC_API_KEY=

export GEMINI_API_KEY=

```

Then run the evaluation with:

```bash

python eval.py \

  agent.type=naive \

  agent.max_image_history=0 \

  eval.num_workers=64 \

  client.client_name=openai \

  client.model_id=gpt-4o-mini-2024-07-18

```

## Documentation

- [Evaluation Guide](https://github.com/balrog-ai/BALROG/blob/main/docs/evaluation.md) - Detailed instructions for various evaluation scenarios

- [Agent Development](https://github.com/balrog-ai/BALROG/blob/main/docs/agents.md) - Tutorial on creating custom agents

- [Few Shot Learning](https://github.com/balrog-ai/BALROG/blob/main/docs/few_shot_learning.md) - Instructions on how to run Few Shot Learning

We welcome contributions! Please see our [Contributing Guidelines](https://github.com/balrog-ai/BALROG/blob/main/docs/contribution.md) for details.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Citation

If you use BALROG in any of your work, please cite:

```

@article{paglieri2024balrog,

  title={Benchmarking Agentic LLM and VLM Reasoning On Games},

  author={Paglieri, Davide and Cupia{\l}, Bart{\l}omiej and Coward, Sam and Piterbarg, Ulyana and Wo{\l}czyk, Maciej and Khan, Akbir and Pignatelli, Eduardo and Kuci{\'n}ski, {\L}ukasz and Pinto, Lerrel and Fergus, Rob and Foerster, Jakob Nicolaus and Parker-Holder, Jack and Rockt{\"a}schel, Tim},

  journal={arXiv preprint arXiv:2411.13543},

  year={2024}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/balrog-ai/BALROG

Awesome Lists containing this project

README