{"id":23934059,"url":"https://github.com/allenai/wildguard","last_synced_at":"2025-10-13T15:56:19.657Z","repository":{"id":246713456,"uuid":"814437286","full_name":"allenai/wildguard","owner":"allenai","description":"Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs","archived":false,"fork":false,"pushed_at":"2024-12-02T17:32:12.000Z","size":101,"stargazers_count":91,"open_issues_count":3,"forks_count":11,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-09-22T01:24:31.080Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/allenai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-13T02:45:34.000Z","updated_at":"2025-09-14T19:47:14.000Z","dependencies_parsed_at":null,"dependency_job_id":"afeb39d1-f4ed-44b5-90b7-62eb040b0461","html_url":"https://github.com/allenai/wildguard","commit_stats":null,"previous_names":["allenai/wildguard"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/allenai/wildguard","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Fwildguard","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Fwildguard/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Fwildguard/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Fwildguard/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/allenai","download_url":"https://codeload.github.com/allenai/wildguard/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Fwildguard/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279015939,"owners_count":26085777,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-06T00:30:05.258Z","updated_at":"2025-10-13T15:56:19.629Z","avatar_url":"https://github.com/allenai.png","language":"Python","funding_links":[],"categories":["Tools","Building","A01_文本生成_文本对话"],"sub_categories":["Safety","Tools","大语言对话模型及数据"],"readme":"# WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://arxiv.org/abs/2406.18495\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/📝-Paper-blue\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://huggingface.co/datasets/allenai/wildguardmix\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/🤗-Data-orange\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://huggingface.co/allenai/wildguard\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/🤗-Model-green\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n**Authors:**\n[Seungju Han](https://seungjuhan.me) ⭐,\n[Kavel Rao](https://kavelrao.dev) ⭐,\n[Allyson Ettinger](https://aetting.github.io) ☀️,\n[Liwei Jiang](https://liweijiang.me) ☀️,\n[Yuchen Lin](https://yuchenlin.xyz/),\n[Nathan Lambert](https://www.natolambert.com/),\n[Yejin Choi](https://homes.cs.washington.edu/~yejin/),\n[Nouha Dziri](https://nouhadziri.github.io/)\n\n⭐ Co-first authors, ☀️ co-second authors\n\n🌟 WildGuard will appear at NeurIPS 2024 Datasets \u0026 Benchmarks! 🌟\n\n[WildGuard](https://arxiv.org/pdf/2406.18495) is a safety classification model for user-model chat exchanges. It can classify prompt harmfulness, response harmfulness, and whether a response is a refusal to answer the prompt.\n\nPlease see our companion repository [Safety-Eval](https://github.com/allenai/safety-eval) for the details of evaluations run in the WildGuard paper.\n\n## Installation\n\n```bash\npip install wildguard\n```\n\n## Quick Start\n\n```python\nfrom wildguard import load_wildguard\n\nif __name__ == '__main__':\n    # Load the model\n    wildguard = load_wildguard()\n\n    # Prepare your input\n    items = [\n        {\"prompt\": \"How do I make a bomb?\", \"response\": \"Sorry, I can't help with that.\"},\n        {\"prompt\": \"What's the weather like today?\"},\n    ]\n\n    # Classify the items\n    results = wildguard.classify(items)\n\n    # Print the results\n    for item, result in zip(items, results):\n        print(f\"Prompt: {item['prompt']}\")\n        print(f\"Prompt harmfulness: {result['prompt_harmfulness']}\")\n        if 'response' in item:\n            print(f\"Response: {item['response']}\")\n            print(f\"Response harmfulness: {result['response_harmfulness']}\")\n            print(f\"Response refusal: {result['response_refusal']}\")\n        print(\"---\")\n\n# Example output\n\"\"\"\nPrompt: How do I make a bomb?\nPrompt harmfulness: harmful\nResponse: Sorry, I can't help with that.\nResponse harmfulness: unharmful\nResponse refusal: refusal\n---\nPrompt: What's the weather like today?\nPrompt harmfulness: unharmful\n\"\"\"\n```\n\n## Features\n\n- Support prompt-only or prompt+response inputs.\n- Classify prompt harmfulness\n- Classify response harmfulness\n- Detect response refusals\n- Support for both VLLM and HuggingFace backends\n\n## User Guide\n\n### Loading the Model\n\nFirst, import and load the WildGuard model:\n\n```python\nfrom wildguard import load_wildguard\n\nwildguard = load_wildguard()\n```\n\nBy default, this will load a VLLM-backed model. If you prefer to use a HuggingFace model, you can specify:\n\n```python\nwildguard = load_wildguard(use_vllm=False)\n```\n\n### Classifying Items\n\nTo classify items, prepare a list of dictionaries with 'prompt' and optionally 'response' keys:\n\n```python\nitems = [\n    {\"prompt\": \"How's the weather today?\", \"response\": \"It's sunny and warm.\"},\n    {\"prompt\": \"How do I hack into a computer?\"},\n]\n\nresults = wildguard.classify(items)\n```\n\n### Interpreting Results\n\nThe `classify` method returns a list of dictionaries. Each dictionary contains the following keys:\n\n- `prompt_harmfulness`: Either 'harmful' or 'unharmful'\n- `response_harmfulness`: Either 'harmful', 'unharmful', or None (if no response was provided)\n- `response_refusal`: Either 'refusal', 'compliance', or None (if no response was provided)\n- `is_parsing_error`: A boolean indicating if there was an error parsing the model output\n\n### Adjusting Batch Size\n\nYou can adjust the batch size when loading the model. For a HF model this changes the inference batch size,\nand for both HF and VLLM the save function will be called after every `batch_size` items.\n\n```python\nwildguard = load_wildguard(batch_size=32)\n```\n\n### Using a Specific Device\n\nIf using a HuggingFace model, you can specify the device:\n\n```python\nwildguard = load_wildguard(use_vllm=False, device='cpu')\n```\n\n### Providing a Custom Save Function\n\nYou can provide a custom save function to save intermediate results during classification:\n\n```python\ndef save_results(results: dict):\n  with open(\"/temp/intermediate_results.json\", \"w\") as f:\n    for item in results:\n      f.write(json.dumps(item) + \"\\n\")\n\nwildguard.classify(items, save_func=save_results)\n```\n\n## Best Practices\n\n1. Use VLLM backend for better performance when possible.\n2. Handle potential errors by checking the `is_parsing_error` field in the results.\n3. When dealing with large datasets, consider using a custom save function with a batch size other than -1 to periodically save results after each batch in case of errors.\n\n## Documentation\n\nFor additional documentation, please see our [API Reference](docs/api_reference.md) with detailed method specifications.\n\nAdditionally, we provide an example of how to use WildGuard as a *safety filter to guard another model's inference* at [examples/wildguard_filter](examples/wildguard_filter).\n\n## Citation\n\nIf you find it helpful, please feel free to cite our work!\n\n```\n@misc{wildguard2024,\n      title={WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs}, \n      author={Seungju Han and Kavel Rao and Allyson Ettinger and Liwei Jiang and Bill Yuchen Lin and Nathan Lambert and Yejin Choi and Nouha Dziri},\n      year={2024},\n      eprint={2406.18495},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2406.18495}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fallenai%2Fwildguard","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fallenai%2Fwildguard","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fallenai%2Fwildguard/lists"}