https://github.com/declare-lab/offtopiceval
https://github.com/declare-lab/offtopiceval
Last synced: 8 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/declare-lab/offtopiceval
- Owner: declare-lab
- License: mit
- Created: 2025-09-27T16:02:38.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-10-08T07:12:58.000Z (8 months ago)
- Last Synced: 2025-10-08T09:11:29.334Z (8 months ago)
- Language: Python
- Size: 233 MB
- Stars: 4
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
---
# OffTopicEval
🤗 Hugging Face  | Â
💻 Code  | Â
📄 Paper
This repo contains the evaluation code and dataset for the paper:
**"OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!"**
---
## News
* **[2025-09]** OffTopicEval released on [Hugging Face](https://huggingface.co/datasets/declare-lab/OffTopicEval) and GitHub.
---
## Leaderboard
### Open-weight models (English)
| Family | Model | ARID | RROODD | RROODA | OS |
| ------- | ----- | --------------- | ---------------------------- | ---------------------------- | ----- |
| Qwen-3 | Qwen3-235B-A22B-Instruct-2507 | 99.05 | 99.32 | 28.70 | 77.77 |
| Mistral | Mistral-Small-3.2-24B-Instruct-2506 | 73.14 | 99.91 | 76.44 | 79.96 |
| GPT-OSS | gpt-oss-120b | 99.32 | 80.42 | 35.82 | 73.33 |
| Phi-4 | phi-4 | 95.14 | 83.74 | 27.75 | 70.30 |
| Gemma-3 | gemma-3-27b-it | 73.71 | 94.22 | 18.21 | 63.78 |
| Llama-3 | Llama-3.3-70B-Instruct | 99.62 | 69.73 | 4.21 | 53.93 |
### Closed-weight models (English)
| Family | Model | ARID | RROODD | RROODA | OS |
| ------ | ---------- | --------------- | ---------------------------- | ---------------------------- | --------- |
| Claude | Opus 4.1 | 99.81 | 95.14 | 95.24 | **97.45** |
| Gemini | 2.5 Pro | 94.76 | 99.90 | 99.19 | **97.09** |
| GPT | GPT-5 | 99.05 | 98.38 | 63.35 | 89.04 |
| GPT | 4o-mini | 64.76 | 97.62 | 92.68 | 77.07 |
| Gemini | Flash-Lite | 96.67 | 98.86 | 37.32 | 79.90 |
| Claude | 3.5 Haiku | 99.90 | 7.90 | 77.96 | 60.05 |
---
## Overview
We introduce **OffTopicEval**, a multilingual benchmark for evaluating **operational safety** of LLM-based agents.
* **Operational Safety** = ability to accept **in-domain (ID)** queries and refuse **out-of-domain (OOD)** queries.
* **Challenge:** Even top-performing LLMs fail on adaptive OOD queries (queries rewritten to look in-domain).
* **Scale:** 21 agents × 220K test samples (ID + direct OOD + adaptive OOD).
* **Languages:** English, Chinese, Hindi.

---
## Data
* **ID queries:** 50 per agent × 3 languages = 150 × 21 = 3,150.
* **Direct OOD queries:** ~3,351 from MMLU × 3 languages = 10,053.
* **Adaptive OOD queries:** adversarially transformed → 211,113 samples.
Data includes:
* **Direct OODs:** From filtered MMLU (factual MCQs).
* **Adaptive OODs:** Prompt-laundered using Llama-70B.
* **ID queries:** Generated by ChatGPT-5, manually verified.
* **Multilingual:** Translations from Global-MMLU (Zh, Hi).

---
## Experiments
* **20 open-weight LLMs**: GPT-OSS, Llama-3, Gemma-3, Qwen-3, Mistral, Phi.
* **6 closed-weight LLMs**: GPT-5, GPT-4o-mini, Claude 4.1, Claude 3.5 Haiku, Gemini Pro, Gemini Flash-Lite.
* **Metrics:**
* **ARID**: Acceptance Rate on ID.
* **RROODD**: Refusal Rate on direct OOD.
* **RROODA**: Refusal Rate on adaptive OOD.
* **OS**: Harmonic mean of ARID and RROOD.
## Citation
If you find our work useful, please cite:
```bibtex
@article{lei2025offtopiceval,
title={OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!},
author={Lei, Jingdi and Gumma, Varun and Bhardwaj, Rishabh and Lim, Seok Min and Li, Chuan and Zadeh, Amir and Poria, Soujanya},
year={2025},
journal={arXiv preprint arXiv:2509.26495},
url={https://arxiv.org/abs/2509.26495}
}