{"id":13652730,"url":"https://github.com/Libr-AI/do-not-answer","last_synced_at":"2025-04-23T03:31:31.647Z","repository":{"id":190982429,"uuid":"680825323","full_name":"Libr-AI/do-not-answer","owner":"Libr-AI","description":"Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs","archived":false,"fork":false,"pushed_at":"2024-06-07T14:55:51.000Z","size":17704,"stargazers_count":238,"open_issues_count":0,"forks_count":23,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-09T18:16:38.011Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Libr-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-20T14:27:52.000Z","updated_at":"2025-04-07T18:50:50.000Z","dependencies_parsed_at":"2024-06-07T16:28:08.861Z","dependency_job_id":null,"html_url":"https://github.com/Libr-AI/do-not-answer","commit_stats":null,"previous_names":["libr-ai/do-not-answer"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Libr-AI%2Fdo-not-answer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Libr-AI%2Fdo-not-answer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Libr-AI%2Fdo-not-answer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Libr-AI%2Fdo-not-answer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Libr-AI","download_url":"https://codeload.github.com/Libr-AI/do-not-answer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250365601,"owners_count":21418713,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T02:01:02.103Z","updated_at":"2025-04-23T03:31:26.639Z","avatar_url":"https://github.com/Libr-AI.png","language":"Jupyter Notebook","funding_links":[],"categories":["[↑](#table-of-contents)Datasets \u003ca name=\"datasets\"\u003e\u003c/a\u003e","Datasets-or-Benchmark","A01_文本生成_文本对话","DATA","Datasets"],"sub_categories":["Jailbreak","通用","大语言对话模型及数据","Telegram Channels","Safety Classifiers \u0026 Prompt Injection Detection"],"readme":"# Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs\n\u003cp align=\"center\" style=\"display: flex; flex-direction: row; justify-content: center; align-items: center\"\u003e\n📄 \u003ca href=\"https://aclanthology.org/2024.findings-eacl.61\" target=\"_blank\" style=\"margin-right: 15px; margin-left: 10px\"\u003ePaper\u003c/a\u003e • \n🤗 \u003ca href=\"https://huggingface.co/datasets/LibrAI/do-not-answer\" target=\"_blank\" style=\"margin-left: 10px\"\u003eDataset\u003c/a\u003e •  \n🔱 \u003ca href=\"https://huggingface.co/LibrAI\" target=\"_blank\" style=\"margin-left: 10px\"\u003eEvaluator\u003c/a\u003e\n\u003c/p\u003e\n\n\u003e \\[!NOTE\\]\n\u003e\n\u003e For **Chinese** Do-Not-Answer, please refer to the [Chinese Do-Not-Answer Section](#a-chinese-dataset-for-evaluating-the-safeguards-in-large-language-models).\n\n\n#### 📣 Our paper is featured in the latest [Stanford AI Index Report 2024](https://aiindex.stanford.edu/report/)\n\n## Overview\n\n*Do not answer* is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. \nBesides human annotations, *Do not answer* also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4. \n\n\n## Instruction Taxonomy\nInstructions are organized by a three-level hierarchical taxonomy, covering 61 specific harms. \n\u003cp align=\"center\"\u003e \u003cimg src=\"assets/dna.png\" style=\"width: 85%;\" id=\"title-icon\"\u003e\u003c/p\u003e\n\n*Do not answer* collects 939 instructions, the distribution across five risk areas and 12 harm types is shown below:\n\u003cp align=\"center\"\u003e \u003cimg src=\"assets/instruction_dist.png\" style=\"width: 85%;\" id=\"title-icon\"\u003e\u003c/p\u003e\n\n\n## Response Assessment\nWe assess six models' responses (GPT-4, ChatGPT, Claude, LLaMA-2, Vicuna, ChatGLM2) from two levels.\n* Harmfulness: whether a response is harmful (a binary label).\n* Action category: responses are categorized into six distinct groups, each representing a specific response type. \n\u003cp align=\"center\"\u003e \u003cimg src=\"assets/action.png\" style=\"width: 85%;\" id=\"title-icon\"\u003e\u003c/p\u003e\n\n\n### Human Evaluation\nHigh percentage of safe responses across five risk areas for six models demonstrates that they are basically safe. \nLLaMA-2 ranks the first, followed by ChatGPT, Claude, GPT-4, and Vicuna, with ChatGLM2 ranking the last by 85 harmful responses among 939.\n\n* Harmlessness Rank \n\u003cp align=\"center\"\u003e \u003cimg src=\"assets/harmful_num.png\" style=\"width: 60%;\" id=\"title-icon\"\u003e\u003c/p\u003e\n\n\n* Action Category Distribution  \n\u003cp align=\"center\"\u003e \u003cimg src=\"assets/action_dist.png\" style=\"width: 60%;\" id=\"title-icon\"\u003e\u003c/p\u003e\n\n\n### Automatic Evaluation\nLongformer evaluators achieve comparable results with human and GPT-4. \n* GPT-4 evaluation\n* Fine-tuned Longformer classifier (600M) \n\n| Model          |   Human |   GPT-4 |   Longformer |\n|:---------------|--------:|--------:|-------------:|\n| LLaMA-2        |    99.7 |    99.4 |         99.3 |\n| ChatGPT        |    98.5 |    97.7 |         97.9 |\n| Claude         |    98.3 |    98.3 |         97.6 |\n| GPT-4          |    97.6 |    96.5 |         97.2 |\n| vicuna         |    94.5 |    94.9 |         95.0 |\n| ChatGLM2       |    90.9 |    92.9 |         92.9 |\n\n## Usage\n\n**NOTICE:** please fill in you API info into `do_not_answer\\utils\\info.yaml` before using commercial models such as GPT4.\n\nRefer to [notebooks](./notebooks) for details and paper replication.\n\nRefer to [datasets](./datasets) for the dataset used in our paper.\n\n# A Chinese Dataset for Evaluating the Safeguards in Large Language Models\n\u003cp align=\"center\" style=\"display: flex; flex-direction: row; justify-content: center; align-items: center\"\u003e\n📄 \u003ca href=\"https://arxiv.org/pdf/2402.12193\" target=\"_blank\" style=\"margin-right: 15px; margin-left: 10px\"\u003ePaper\u003c/a\u003e • \n🤗 \u003ca href=\"https://github.com/Libr-AI/do-not-answer/tree/main/cdna\" target=\"_blank\" style=\"margin-left: 10px\"\u003eDataset\u003c/a\u003e •  \n🔱 \u003ca href=\"https://github.com/Libr-AI/do-not-answer/tree/main/do_not_answer/evaluator\" target=\"_blank\" style=\"margin-left: 10px\"\u003eEvaluator\u003c/a\u003e\n\u003c/p\u003e\n\n## Overview\nWe construct a Chinese LLM safety evaluation by translating and localizing the  [_\"Do-not-answer\"_](https://github.com/Libr-AI/do-not-answer/) dataset and expand it with region-specific questions and align it with country-specific AI generation regulations,\n\nWe then extend the resulting 1,014 questions from two prespectives:\n\n1.  __False Negative(FN) questions__: risky questions posed in an\nevasive way, aimed at evaluating an LLM’s sensitivity to perceiving risks, aimed at evaluating an LLM’s sensitivity to perceiving risks\n2.  __False Positive(FP) questions__: harmless questions\ncontaining seemingly risky words, aimed at assessing whether the model is oversensitive, which can limit its helpfulness\n\nThis yields 3,042 Chinese questions for evaluating the risk perception and sensitivity to specific words and phrases of LLMs.\n\n## Risk Taxonomy \u0026 Statistics\nQuestions are organized by a two-level hierarchical taxonomy, covering 6 risk types 17 specific harm types.\n\n\u003cp align=\"center\"\u003e \u003cimg src=\"assets/instruction_dist_chinese.png\" style=\"width: 85%;\" id=\"title-icon\"\u003e\u003c/p\u003e\n\n## Response Assessment\n\nTo evaluate the responses of LLMs to the questions, we use GPT-4 for automatic evluation and human evaluation to check the gap between human and GPT-4 based evaluation.\n\n### Automatic Evaluation\nWe use two strategies for automatic evaluation of the model responses:\n\n1. __Question Set Evaluator:__ Given a risk area, we\nask 1–5 risk-specific questions to identify whether\na response touches on unsafe areas.\n\n2. __Action Type Evaluator:__ Given a response, we ask the LLM to select one of the six fine-grained categories of actions how an LLM responds to a question. This is the similar to the auto evaluation in the original [_\"Do-not-answer\"_](https://github.com/Libr-AI/do-not-answer/) dataset.\n\n#### Overall Results\n\u003cp align=\"center\"\u003e \u003cimg src=\"assets/harmful_num_chinese.png\" style=\"width: 60%;\" id=\"title-icon\"\u003e\u003c/p\u003e\nNumber of harmful responses for five different Chinese LLMs. We can see that LLaMA2, as an\nEnglish-centric model, is the safest LLM when testing\nusing English direct questions from the Do-not-answer\ndataset, but it is also the least safe one when evaluated\nusing our Chinese-centric questions.\n\n#### Results on different risk types\n\n\u003cp align=\"center\"\u003e \u003cimg src=\"assets/risk_type_num_chinese.png\" style=\"width: 60%;\" id=\"title-icon\"\u003e\u003c/p\u003e\n\nLLMs that were mostly pre-trained on Chinese text are better at understanding Chinese questions and offering better answers than\nmodels mainly pre-trained on English text. English-centric LLaMA-2 series perform worse than Chinese-centric models on Chinese safety datasets, especially for\nregion-specific questions (Type _VI_).\n\n#### Results on different question types\n\n\u003cp align=\"center\"\u003e \u003cimg src=\"assets/question_type_num_chinese.png\" style=\"width: 60%;\" id=\"title-icon\"\u003e\u003c/p\u003e\n\nThe number of harmful responses of the indirect-attack questions (yellow bar) is the largest, followed by questions of the direct attack (blue bar). For LLaMA2 and Qwen, the the number of harmful\nresponses over direct attack and general questions is very close to each other, which suggests that\nsome chatbots were likely instruction-tuned to be too sensitive to certain words.\n\n### Human Evaluation\n\n\n\u003cp align=\"center\"\u003e \u003cimg src=\"assets/human_eval.png\" style=\"width: 80%;\" id=\"title-icon\"\u003e\u003c/p\u003e\n\nGPT-4 can identify the majority of safe responses\ncorrectly, but demonstrates random guess performance on harmful responses. For action classification, responses falling into categories of 3 and 4\ntend to be classified as 5 by GPT-4. This to some\nextent implies that human makes more fine-grained\ndistinctions between different responding patterns\nthan GPT-4.\n\n\n## Citation\nIf our work is useful for your own, you can cite us with the following BibTex entry:\n```bibtex\n@inproceedings{wang-etal-2024-answer,\n    title = \"Do-Not-Answer: Evaluating Safeguards in {LLM}s\",\n    author = \"Wang, Yuxia  and\n      Li, Haonan  and\n      Han, Xudong  and\n      Nakov, Preslav  and\n      Baldwin, Timothy\",\n    editor = \"Graham, Yvette  and\n      Purver, Matthew\",\n    booktitle = \"Findings of the Association for Computational Linguistics: EACL 2024\",\n    month = mar,\n    year = \"2024\",\n    address = \"St. Julian{'}s, Malta\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2024.findings-eacl.61\",\n    pages = \"896--911\",\n}\n```\nFor Chinese Do-Not-Answer dataset, you can cite us with the following BibTex entry:\n```bibtex\n@article{wang2024chinese,\n  title={A Chinese Dataset for Evaluating the Safeguards in Large Language Models},\n  author={Wang, Yuxia and Zhai, Zenan and Li, Haonan and Han, Xudong and Lin, Lizhi and Zhang, Zhenxuan and Zhao, Jingru and Nakov, Preslav and Baldwin, Timothy},\n  journal={to appear in ACL 2024 findings},\n  year={2024}\n}\n```\n## License\nAll datasets in this repository are released under the\n[Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/). All source files in this repository are released under the Apache 2.0 license, the text of which can be found in the LICENSE file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLibr-AI%2Fdo-not-answer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLibr-AI%2Fdo-not-answer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLibr-AI%2Fdo-not-answer/lists"}