An open API service indexing awesome lists of open source software.

https://github.com/mbzuai-oryx/arb

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark
https://github.com/mbzuai-oryx/arb

arabic benchmark cot lmm reasoning

Last synced: 3 months ago
JSON representation

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

Awesome Lists containing this project

README

          




ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark




[Sara Ghaboura](https://huggingface.co/SLMLAH) *  
[Ketan More](https://github.com/ketanmore2002) *  
[Wafa Alghallabi](https://huggingface.co/SLMLAH)  
[Omkar Thawakar](https://omkarthawakar.github.io)  


[Jorma Laaksonen](https://scholar.google.com/citations?user=qQP6WXIAAAAJ&hl=en)  
[Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)  
[Salman Khan](https://scholar.google.com/citations?hl=en&user=M59O9lkAAAAJ)  
[Rao M. Anwer](https://scholar.google.com/citations?hl=en&user=_KlvMVoAAAAJ)

*Equal Contribution




[![arXiv](https://img.shields.io/badge/arXiv-2505.17021-C4EAE5)](https://arxiv.org/abs/2505.17021)
[![Our Page](https://img.shields.io/badge/Visit-Our%20Page-C5D9D9?style=flat)](https://mbzuai-oryx.github.io/ARB/)
[![GitHub issues](https://img.shields.io/github/issues/mbzuai-oryx/Camel-Bench?color=D8EADC&label=issues&style=flat)](https://github.com/mbzuai-oryx/ARB/issues)
[![GitHub stars](https://img.shields.io/github/stars/mbzuai-oryx/TimeTravel?color=C7D7E3&style=flat)](https://github.com/mbzuai-oryx/ARB/stargazers)
[![GitHub license](https://img.shields.io/github/license/mbzuai-oryx/Camel-Bench?color=C8B9A7)](https://github.com/mbzuai-oryx/ARB/blob/main/LICENSE)


*Equal Contribution









If you like our project, please give us a star ⭐ on GitHub for the latest update.











## Latest Updates
🔥 **[22 May 2025]** ARB is **1st** Arabic multimodal benchmark focused on step-by-step reasoning is released.

🤗 **[22 May 2025]** ARB dataset available on [HuggingFace](https://huggingface.co/datasets/MBZUAI/ARB).







## ARB Scope and Diversity

ARB is the first benchmark focused on step-by-step reasoning in Arabic cross both textual and visual modalities, covering 11 diverse domains spanning science, culture, OCR, and historical interpretation.


Figure: ARB Dataset Coverage


## 🌟 Key Features

- **1,356** multimodal samples, each with an image, Arabic question, and reasoning-based answer.
- **5,119** curated reasoning steps reflecting human logic
- **11 diverse domains**, from visual reasoning to historical and scientific analysis.
- **Native Arabic speakers** and **domain experts** verified.
- **Hybrid sources**: original Arabic data, high-quality translations, and synthetic samples.
- **Robust evaluation framework** for final answer accuracy and reasoning quality
- Fully **open-source dataset** and toolkit to support research in **Arabic reasoning and multimodal AI**.


## 🏗️ ARB Construction Pipeline


Figure: ARB Pipeline Overview


## ARB Collection


Figure: ARB Collection



## ARB Data Distribution over Domains


Figure: ARB dist



### Source Types Across Domains

| **Domain** | **English Bench** | **Arabic Bench** | **Human-Created** | **Synthetic** |
|---------------------------|:-----------------:|:----------------:|:-----------------:|:-------------:|
| Visual Reasoning | ✅ | – | – | – |
| OCR & Document Analysis | – | – | ✅ | ✅ |
| Chart & Data Table (CDT) | ✅ | ✅ | ✅ | ✅ |
| Math & Logic | ✅ | – | – | – |
| Social & Cultural | ✅ | – | – | – |
| Computer Vision Perception| ✅ | – | – | – |
| Medical Image Analysis | ✅ | ✅ | – | – |
| Scientific Reasoning | ✅ | – | – | – |
| Agricultural Interpretation | ✅ | – | ✅ | ✅ |
| Remote Sensing Understanding | – | ✅ | – | – |
| Historical & Anthropological | ✅ | – | ✅ | ✅ |



## Download

```bash
from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("MBZUAI/ARB")
```

## Evaluation Protocol




We evaluated 12 open- and closed-source LMMs using:


- **Lexical and Semantic Similarity Scoes**: BLEU, ROUGE, BERTScore.
- **Cross-lingual semantic alignment**: LaBSE
- **Custom Rubric (Arabic):**: Our curated metric rebric includes 10 factors like faithfulness, interpretive depth, coherence, hallucination, and more.

### LLM-as-Judge (Arabic prompt-based)

We evaluate models using:

- Step-by-step reasoning quality (coherence, informativeness, commonsense)
- Final answer accuracy
- Agreement with human raters (Krippendorff’s Alpha > 87%)



## Stepwise Evaluation Results
For Closed-Source Models:

| | GPT-4o | GPT-4o-mini | GPT-4.1 | o4-mini | Gemini 1.5 Pro | Gemini 2.0 Flash |
|:--------------------|---------:|--------------:|----------:|----------:|-----------------:|-------------------:|
| Final Answer (%) | 60.22 | 52.22 | 59.43 | 58.93 | 56.7 | 57.8 |
| Reasoning Steps (%) | 64.29 | 61.02 | 80.41 | 80.75 | 64.34 | 64.09 |

For Open-Source Models:

| | Qwen2.5-VL-7B | Llama-3.2-11B | AIN | Llama-4 Scout | Aya-Vision-8B | InternVL3-8B |
|:--------------------|----------------:|----------------:|------:|----------------:|----------------:|---------------:|
| Final Answer (%) | 37.02 | 25.58 | 27.35 | 48.52 | 28.81 | 31.04 |
| Reasoning Steps (%) | 64.03 | 53.2 | 52.77 | 77.7 | 63.64 | 54.5 |


## 📂 Dataset Structure


Each sample includes:
- `image`: Visual input
- `question`: Arabic reasoning prompt
- `choices`: The choices for MCQ
- `steps`: Ordered reasoning chain
- `answer`: Final solution (Arabic)
- `domain`: One of 11 categories (e.g., OCR, Scientific, Visual, Math)
- `curriculum`: One of the 4 curricula followed by the prompt for steps generation (Computational, Sci/Med, Textual/Partial, and General)



## Citation
If you use ARB dataset in your research, please consider citing:

```bibtex
@misc{ghaboura2025arbcomprehensivearabicmultimodal,
title={ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark},
author={Sara Ghaboura and Ketan More and Wafa Alghallabi and Omkar Thawakar and Jorma Laaksonen and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},
year={2025},
eprint={2505.17021},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.17021},
}
```

---