https://github.com/mbzuai-oryx/arb

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark
https://github.com/mbzuai-oryx/arb

arabic benchmark cot lmm reasoning

Last synced: 4 months ago
JSON representation

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

Host: GitHub
URL: https://github.com/mbzuai-oryx/arb
Owner: mbzuai-oryx
License: apache-2.0
Created: 2025-05-21T07:38:29.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-22T13:59:36.000Z (about 1 year ago)
Last Synced: 2025-05-22T15:31:24.311Z (about 1 year ago)
Topics: arabic, benchmark, cot, lmm, reasoning
Language: Python
Homepage: https://slmlah.github.io/ARB/
Size: 28.9 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          



  





    ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark



 
 

    

  [Sara Ghaboura](https://huggingface.co/SLMLAH) ^*  

  [Ketan More](https://github.com/ketanmore2002) ^*  

  [Wafa Alghallabi](https://huggingface.co/SLMLAH)  

  [Omkar Thawakar](https://omkarthawakar.github.io)   

  


  [Jorma Laaksonen](https://scholar.google.com/citations?user=qQP6WXIAAAAJ&hl=en)  

  [Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)  

  [Salman Khan](https://scholar.google.com/citations?hl=en&user=M59O9lkAAAAJ)  

  [Rao M. Anwer](https://scholar.google.com/citations?hl=en&user=_KlvMVoAAAAJ)


   ^{*Equal Contribution} 

  


  
  

  [![arXiv](https://img.shields.io/badge/arXiv-2505.17021-C4EAE5)](https://arxiv.org/abs/2505.17021)

  [![Our Page](https://img.shields.io/badge/Visit-Our%20Page-C5D9D9?style=flat)](https://mbzuai-oryx.github.io/ARB/)

  [![GitHub issues](https://img.shields.io/github/issues/mbzuai-oryx/Camel-Bench?color=D8EADC&label=issues&style=flat)](https://github.com/mbzuai-oryx/ARB/issues)

  [![GitHub stars](https://img.shields.io/github/stars/mbzuai-oryx/TimeTravel?color=C7D7E3&style=flat)](https://github.com/mbzuai-oryx/ARB/stargazers)

  [![GitHub license](https://img.shields.io/github/license/mbzuai-oryx/Camel-Bench?color=C8B9A7)](https://github.com/mbzuai-oryx/ARB/blob/main/LICENSE)

  


   ^{*Equal Contribution} 

  


  


  

  



    

 





  If you like our project, please give us a star ⭐ on GitHub for the latest update. 









    

 







 

##  Latest Updates

 🔥  **[22 May 2025]** ARB is **1st** Arabic multimodal benchmark focused on step-by-step reasoning is released.


 🤗  **[22 May 2025]** ARB dataset available on [HuggingFace](https://huggingface.co/datasets/MBZUAI/ARB).








  



  

##  ARB Scope and Diversity

 

ARB  is the first benchmark focused on  step-by-step reasoning in Arabic cross both textual and visual modalities, covering 11 diverse domains spanning science, culture, OCR, and historical interpretation.






   





## 🌟 Key Features

- **1,356** multimodal samples, each with an image, Arabic question, and reasoning-based answer.

- **5,119** curated reasoning steps reflecting human logic

- **11 diverse domains**, from visual reasoning to historical and scientific analysis.

- **Native Arabic speakers** and **domain experts** verified.

- **Hybrid sources**: original Arabic data, high-quality translations, and synthetic samples.

- **Robust evaluation framework** for final answer accuracy and reasoning quality

- Fully **open-source dataset** and toolkit to support research in **Arabic reasoning and multimodal AI**.




## 🏗️ ARB Construction Pipeline



   






##  ARB Collection



   






##  ARB Data Distribution over Domains



   





  

### Source Types Across Domains

| **Domain**                 | **English Bench** | **Arabic Bench** | **Human-Created** | **Synthetic** |

|---------------------------|:-----------------:|:----------------:|:-----------------:|:-------------:|

| Visual Reasoning          | ✅                | –                | –                 | –             |

| OCR & Document Analysis   | –                 | –                | ✅                | ✅            |

| Chart & Data Table (CDT)  | ✅                | ✅               | ✅                | ✅            |

| Math & Logic              | ✅                | –                | –                 | –             |

| Social & Cultural         | ✅                | –                | –                 | –             |

| Computer Vision Perception| ✅                | –                | –                 | –             |

| Medical Image Analysis    | ✅                | ✅               | –                 | –             |

| Scientific Reasoning      | ✅                | –                | –                 | –             |

| Agricultural Interpretation | ✅              | –                | ✅                | ✅            |

| Remote Sensing Understanding | –             | ✅               | –                 | –             |

| Historical & Anthropological | ✅            | –                | ✅                | ✅            |






##   Download

```bash

from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset

ds = load_dataset("MBZUAI/ARB")

```




##   Evaluation Protocol





  

We evaluated 12 open- and closed-source LMMs using:

- **Lexical and Semantic Similarity Scoes**: BLEU, ROUGE, BERTScore.

- **Cross-lingual semantic alignment**: LaBSE

- **Custom Rubric (Arabic):**: Our curated metric rebric includes 10 factors like faithfulness, interpretive depth, coherence, hallucination, and more.

###  LLM-as-Judge (Arabic prompt-based)

We evaluate models using:

- Step-by-step reasoning quality (coherence, informativeness, commonsense)

- Final answer accuracy

- Agreement with human raters (Krippendorff’s Alpha > 87%)






##  Stepwise Evaluation Results 

For Closed-Source Models:


|                     |   GPT-4o |   GPT-4o-mini |   GPT-4.1 |   o4-mini |   Gemini 1.5 Pro |   Gemini 2.0 Flash |

|:--------------------|---------:|--------------:|----------:|----------:|-----------------:|-------------------:|

| Final Answer (%)    |    60.22 |         52.22 |     59.43 |     58.93 |            56.7  |              57.8  |

| Reasoning Steps (%) |    64.29 |         61.02 |     80.41 |     80.75 |            64.34 |              64.09 |




For Open-Source Models:


|                     |   Qwen2.5-VL-7B |   Llama-3.2-11B |   AIN |   Llama-4 Scout |   Aya-Vision-8B |   InternVL3-8B |

|:--------------------|----------------:|----------------:|------:|----------------:|----------------:|---------------:|

| Final Answer (%)    |           37.02 |           25.58 | 27.35 |           48.52 |           28.81 |          31.04 |

| Reasoning Steps (%) |           64.03 |           53.2  | 52.77 |           77.7  |           63.64 |          54.5  |




## 📂 Dataset Structure





Each sample includes:

- `image`: Visual input

- `question`: Arabic reasoning prompt

- `choices`: The choices for MCQ

- `steps`: Ordered reasoning chain

- `answer`: Final solution (Arabic)

- `domain`: One of 11 categories (e.g., OCR, Scientific, Visual, Math)

- `curriculum`: One of the 4 curricula followed by the prompt for steps generation (Computational, Sci/Med, Textual/Partial, and General)










##  Citation

If you use ARB dataset in your research, please consider citing:

```bibtex

@misc{ghaboura2025arbcomprehensivearabicmultimodal,

      title={ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark}, 

      author={Sara Ghaboura and Ketan More and Wafa Alghallabi and Omkar Thawakar and Jorma Laaksonen and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},

      year={2025},

      eprint={2505.17021},

      archivePrefix={arXiv},

      primaryClass={cs.CV},

      url={https://arxiv.org/abs/2505.17021}, 

}

```



---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mbzuai-oryx/arb

Awesome Lists containing this project

README

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark