{"id":31956765,"url":"https://github.com/declare-lab/offtopiceval","last_synced_at":"2025-10-14T14:53:17.176Z","repository":{"id":317470934,"uuid":"1065382802","full_name":"declare-lab/OffTopicEval","owner":"declare-lab","description":null,"archived":false,"fork":false,"pushed_at":"2025-10-08T07:12:58.000Z","size":244588,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-08T09:11:29.334Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/declare-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-27T16:02:38.000Z","updated_at":"2025-10-08T07:13:01.000Z","dependencies_parsed_at":"2025-10-01T05:30:25.866Z","dependency_job_id":"8ccc7417-96d4-462b-bd45-21cba33af992","html_url":"https://github.com/declare-lab/OffTopicEval","commit_stats":null,"previous_names":["declare-lab/offtopiceval"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/declare-lab/OffTopicEval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2FOffTopicEval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2FOffTopicEval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2FOffTopicEval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2FOffTopicEval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/declare-lab","download_url":"https://codeload.github.com/declare-lab/OffTopicEval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2FOffTopicEval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279019140,"owners_count":26086685,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-14T14:53:08.102Z","updated_at":"2025-10-14T14:53:17.170Z","avatar_url":"https://github.com/declare-lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\n---\n\n# OffTopicEval\n\n\u003cp align=\"center\"\u003e \u003cimg src=\"resources/title_v2.png\" style=\"width: 95%;\" id=\"title-icon\"\u003e \u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  🤗 \u003ca href=\"https://huggingface.co/datasets/declare-lab/OffTopicEval\" target=\"_blank\"\u003eHugging Face\u003c/a\u003e \u0026nbsp; | \u0026nbsp;\n  💻 \u003ca href=\"https://github.com/declare-lab/OffTopicEval\" target=\"_blank\"\u003eCode\u003c/a\u003e \u0026nbsp; | \u0026nbsp;\n  📄 \u003ca href=\"https://arxiv.org/abs/2509.26495\" target=\"_blank\"\u003ePaper\u003c/a\u003e \n\u003c/p\u003e\n\nThis repo contains the evaluation code and dataset for the paper:\n**\"OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!\"**\n\n---\n\n## News\n\n* **[2025-09]** OffTopicEval released on [Hugging Face](https://huggingface.co/datasets/declare-lab/OffTopicEval) and GitHub.\n\n---\n\n## Leaderboard\n\n### Open-weight models (English)\n\n| Family  | Model | AR\u003csub\u003eID\u003c/sub\u003e | RR\u003csub\u003eOOD\u003c/sub\u003e\u003csup\u003eD\u003c/sup\u003e | RR\u003csub\u003eOOD\u003c/sub\u003e\u003csup\u003eA\u003c/sup\u003e | OS    |\n| ------- | ----- | --------------- | ---------------------------- | ---------------------------- | ----- |\n| Qwen-3  | Qwen3-235B-A22B-Instruct-2507  | 99.05           | 99.32                        | 28.70                        | 77.77 |\n| Mistral | Mistral-Small-3.2-24B-Instruct-2506   | 73.14           | 99.91                        | 76.44                        | 79.96 |\n| GPT-OSS | gpt-oss-120b  | 99.32           | 80.42                        | 35.82                        | 73.33 |\n| Phi-4   | phi-4   | 95.14           | 83.74                        | 27.75                        | 70.30 |\n| Gemma-3 | gemma-3-27b-it   | 73.71           | 94.22                        | 18.21                        | 63.78 |\n| Llama-3 | Llama-3.3-70B-Instruct   | 99.62           | 69.73                        | 4.21                         | 53.93 |\n\n### Closed-weight models (English)\n\n| Family | Model      | AR\u003csub\u003eID\u003c/sub\u003e | RR\u003csub\u003eOOD\u003c/sub\u003e\u003csup\u003eD\u003c/sup\u003e | RR\u003csub\u003eOOD\u003c/sub\u003e\u003csup\u003eA\u003c/sup\u003e | OS        |\n| ------ | ---------- | --------------- | ---------------------------- | ---------------------------- | --------- |\n| Claude | Opus 4.1   | 99.81           | 95.14                        | 95.24                        | **97.45** |\n| Gemini | 2.5 Pro    | 94.76           | 99.90                        | 99.19                        | **97.09** |\n| GPT    | GPT-5      | 99.05           | 98.38                        | 63.35                        | 89.04     |\n| GPT    | 4o-mini    | 64.76           | 97.62                        | 92.68                        | 77.07     |\n| Gemini | Flash-Lite | 96.67           | 98.86                        | 37.32                        | 79.90     |\n| Claude | 3.5 Haiku  | 99.90           | 7.90                         | 77.96                        | 60.05     |\n\n\n---\n\n## Overview\n\nWe introduce **OffTopicEval**, a multilingual benchmark for evaluating **operational safety** of LLM-based agents.\n\n* **Operational Safety** = ability to accept **in-domain (ID)** queries and refuse **out-of-domain (OOD)** queries.\n* **Challenge:** Even top-performing LLMs fail on adaptive OOD queries (queries rewritten to look in-domain).\n* **Scale:** 21 agents × 220K test samples (ID + direct OOD + adaptive OOD).\n* **Languages:** English, Chinese, Hindi.\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"resources/overview.png\" style=\"width: 85%;\"\u003e\u003c/p\u003e  \n\n---\n\n## Data\n\n* **ID queries:** 50 per agent × 3 languages = 150 × 21 = 3,150.\n* **Direct OOD queries:** ~3,351 from MMLU × 3 languages = 10,053.\n* **Adaptive OOD queries:** adversarially transformed → 211,113 samples.\n\nData includes:\n\n* **Direct OODs:** From filtered MMLU (factual MCQs).\n* **Adaptive OODs:** Prompt-laundered using Llama-70B.\n* **ID queries:** Generated by ChatGPT-5, manually verified.\n* **Multilingual:** Translations from Global-MMLU (Zh, Hi).\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"resources/tsne.png\" style=\"width: 85%;\"\u003e\u003c/p\u003e  \n\n\n---\n\n## Experiments\n\n* **20 open-weight LLMs**: GPT-OSS, Llama-3, Gemma-3, Qwen-3, Mistral, Phi.\n* **6 closed-weight LLMs**: GPT-5, GPT-4o-mini, Claude 4.1, Claude 3.5 Haiku, Gemini Pro, Gemini Flash-Lite.\n* **Metrics:**\n\n  * **AR\u003csub\u003eID\u003c/sub\u003e**: Acceptance Rate on ID.\n  * **RR\u003csub\u003eOOD\u003c/sub\u003e\u003csup\u003eD\u003c/sup\u003e**: Refusal Rate on direct OOD.\n  * **RR\u003csub\u003eOOD\u003c/sub\u003e\u003csup\u003eA\u003c/sup\u003e**: Refusal Rate on adaptive OOD.\n  * **OS**: Harmonic mean of AR\u003csub\u003eID\u003c/sub\u003e and RR\u003csub\u003eOOD\u003c/sub\u003e.\n\n\n## Citation\n\nIf you find our work useful, please cite:\n\n```bibtex\n@article{lei2025offtopiceval,\n  title={OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!},\n  author={Lei, Jingdi and Gumma, Varun and Bhardwaj, Rishabh and Lim, Seok Min and Li, Chuan and Zadeh, Amir and Poria, Soujanya},\n  year={2025},\n  journal={arXiv preprint arXiv:2509.26495},\n  url={https://arxiv.org/abs/2509.26495}\n}\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeclare-lab%2Fofftopiceval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeclare-lab%2Fofftopiceval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeclare-lab%2Fofftopiceval/lists"}