https://github.com/mbzuai-oryx/arb
ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark
https://github.com/mbzuai-oryx/arb
arabic benchmark cot lmm reasoning
Last synced: 3 months ago
JSON representation
ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark
- Host: GitHub
- URL: https://github.com/mbzuai-oryx/arb
- Owner: mbzuai-oryx
- License: apache-2.0
- Created: 2025-05-21T07:38:29.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-22T13:59:36.000Z (about 1 year ago)
- Last Synced: 2025-05-22T15:31:24.311Z (about 1 year ago)
- Topics: arabic, benchmark, cot, lmm, reasoning
- Language: Python
- Homepage: https://slmlah.github.io/ARB/
- Size: 28.9 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark
[Sara Ghaboura](https://huggingface.co/SLMLAH) *
[Ketan More](https://github.com/ketanmore2002) *
[Wafa Alghallabi](https://huggingface.co/SLMLAH)
[Omkar Thawakar](https://omkarthawakar.github.io)
[Jorma Laaksonen](https://scholar.google.com/citations?user=qQP6WXIAAAAJ&hl=en)
[Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)
[Salman Khan](https://scholar.google.com/citations?hl=en&user=M59O9lkAAAAJ)
[Rao M. Anwer](https://scholar.google.com/citations?hl=en&user=_KlvMVoAAAAJ)
*Equal Contribution
[](https://arxiv.org/abs/2505.17021)
[](https://mbzuai-oryx.github.io/ARB/)
[](https://github.com/mbzuai-oryx/ARB/issues)
[](https://github.com/mbzuai-oryx/ARB/stargazers)
[](https://github.com/mbzuai-oryx/ARB/blob/main/LICENSE)
*Equal Contribution
If you like our project, please give us a star ⭐ on GitHub for the latest update.
##
Latest Updates
🔥 **[22 May 2025]** ARB is **1st** Arabic multimodal benchmark focused on step-by-step reasoning is released.
🤗 **[22 May 2025]** ARB dataset available on [HuggingFace](https://huggingface.co/datasets/MBZUAI/ARB).
##
ARB Scope and Diversity
ARB is the first benchmark focused on step-by-step reasoning in Arabic cross both textual and visual modalities, covering 11 diverse domains spanning science, culture, OCR, and historical interpretation.
## 🌟 Key Features
- **1,356** multimodal samples, each with an image, Arabic question, and reasoning-based answer.
- **5,119** curated reasoning steps reflecting human logic
- **11 diverse domains**, from visual reasoning to historical and scientific analysis.
- **Native Arabic speakers** and **domain experts** verified.
- **Hybrid sources**: original Arabic data, high-quality translations, and synthetic samples.
- **Robust evaluation framework** for final answer accuracy and reasoning quality
- Fully **open-source dataset** and toolkit to support research in **Arabic reasoning and multimodal AI**.
## 🏗️ ARB Construction Pipeline
##
ARB Collection

##
ARB Data Distribution over Domains
### Source Types Across Domains
| **Domain** | **English Bench** | **Arabic Bench** | **Human-Created** | **Synthetic** |
|---------------------------|:-----------------:|:----------------:|:-----------------:|:-------------:|
| Visual Reasoning | ✅ | – | – | – |
| OCR & Document Analysis | – | – | ✅ | ✅ |
| Chart & Data Table (CDT) | ✅ | ✅ | ✅ | ✅ |
| Math & Logic | ✅ | – | – | – |
| Social & Cultural | ✅ | – | – | – |
| Computer Vision Perception| ✅ | – | – | – |
| Medical Image Analysis | ✅ | ✅ | – | – |
| Scientific Reasoning | ✅ | – | – | – |
| Agricultural Interpretation | ✅ | – | ✅ | ✅ |
| Remote Sensing Understanding | – | ✅ | – | – |
| Historical & Anthropological | ✅ | – | ✅ | ✅ |
##
Download
```bash
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("MBZUAI/ARB")
```
##
Evaluation Protocol
We evaluated 12 open- and closed-source LMMs using:
- **Lexical and Semantic Similarity Scoes**: BLEU, ROUGE, BERTScore.
- **Cross-lingual semantic alignment**: LaBSE
- **Custom Rubric (Arabic):**: Our curated metric rebric includes 10 factors like faithfulness, interpretive depth, coherence, hallucination, and more.
###
LLM-as-Judge (Arabic prompt-based)
We evaluate models using:
- Step-by-step reasoning quality (coherence, informativeness, commonsense)
- Final answer accuracy
- Agreement with human raters (Krippendorff’s Alpha > 87%)
##
Stepwise Evaluation Results
For Closed-Source Models:
| | GPT-4o | GPT-4o-mini | GPT-4.1 | o4-mini | Gemini 1.5 Pro | Gemini 2.0 Flash |
|:--------------------|---------:|--------------:|----------:|----------:|-----------------:|-------------------:|
| Final Answer (%) | 60.22 | 52.22 | 59.43 | 58.93 | 56.7 | 57.8 |
| Reasoning Steps (%) | 64.29 | 61.02 | 80.41 | 80.75 | 64.34 | 64.09 |
For Open-Source Models:
| | Qwen2.5-VL-7B | Llama-3.2-11B | AIN | Llama-4 Scout | Aya-Vision-8B | InternVL3-8B |
|:--------------------|----------------:|----------------:|------:|----------------:|----------------:|---------------:|
| Final Answer (%) | 37.02 | 25.58 | 27.35 | 48.52 | 28.81 | 31.04 |
| Reasoning Steps (%) | 64.03 | 53.2 | 52.77 | 77.7 | 63.64 | 54.5 |
## 📂 Dataset Structure
Each sample includes:
- `image`: Visual input
- `question`: Arabic reasoning prompt
- `choices`: The choices for MCQ
- `steps`: Ordered reasoning chain
- `answer`: Final solution (Arabic)
- `domain`: One of 11 categories (e.g., OCR, Scientific, Visual, Math)
- `curriculum`: One of the 4 curricula followed by the prompt for steps generation (Computational, Sci/Med, Textual/Partial, and General)
##
Citation
If you use ARB dataset in your research, please consider citing:
```bibtex
@misc{ghaboura2025arbcomprehensivearabicmultimodal,
title={ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark},
author={Sara Ghaboura and Ketan More and Wafa Alghallabi and Omkar Thawakar and Jorma Laaksonen and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},
year={2025},
eprint={2505.17021},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.17021},
}
```
---