{"id":13346571,"url":"https://github.com/RobustNLP/CipherChat","last_synced_at":"2025-03-12T08:31:11.613Z","repository":{"id":188186714,"uuid":"676834494","full_name":"RobustNLP/CipherChat","owner":"RobustNLP","description":"A framework to evaluate the generalization capability of safety alignment for LLMs","archived":false,"fork":false,"pushed_at":"2024-12-31T01:47:49.000Z","size":17186,"stargazers_count":577,"open_issues_count":0,"forks_count":63,"subscribers_count":10,"default_branch":"main","last_synced_at":"2024-12-31T02:35:43.649Z","etag":null,"topics":["alignment","chatgpt","gpt-4-0613","jailbreak","large-language-models","llm","security"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RobustNLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-10T05:55:17.000Z","updated_at":"2024-12-31T01:47:52.000Z","dependencies_parsed_at":"2024-08-02T01:29:39.173Z","dependency_job_id":null,"html_url":"https://github.com/RobustNLP/CipherChat","commit_stats":null,"previous_names":["robustnlp/cipherchat"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobustNLP%2FCipherChat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobustNLP%2FCipherChat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobustNLP%2FCipherChat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobustNLP%2FCipherChat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RobustNLP","download_url":"https://codeload.github.com/RobustNLP/CipherChat/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243184454,"owners_count":20249981,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","chatgpt","gpt-4-0613","jailbreak","large-language-models","llm","security"],"created_at":"2024-07-29T20:01:23.445Z","updated_at":"2025-03-12T08:31:08.403Z","avatar_url":"https://github.com/RobustNLP.png","language":"Python","readme":"\u003ch1 align=\"center\"\u003eCipherChat 🔐\u003c/h1\u003e\nA novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages – ciphers. \n\u003cbr\u003e   \u003cbr\u003e\n\nIf you have any questions, please feel free to email the first author: [Youliang Yuan](https://github.com/YouliangYuan).\n    \n## 👉 Paper\nFor more details, please refer to our paper [ICLR 2024](https://openreview.net/forum?id=MbfAK4s61A).\n\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"paper/cover.png\" alt=\"Logo\" width=\"500\"\u003e\n\u003c/div\u003e\n\n\u003ch3 align=\"center\"\u003eLOVE💗 and Peace🌊\u003c/h3\u003e\n\u003ch3 align=\"center\"\u003eRESEARCH USE ONLY✅ NO MISUSE❌\u003c/h3\u003e\n\n\n## Our results\nWe provide our results (query-response pairs) in `experimental_results`, these files can be loaded by `torch.load()`. Then, you can get a list: the first element is the config and the rest of the elements are the query-response pairs.\n```\nresult_data = torch.load(filename)\nconfig = result_data[0]\npairs = result_data[1:]\n```\n\n\n\n## 🛠️ Usage\n✨An example run:\n```\npython3 main.py \\\n --model_name gpt-4-0613 \\\n--data_path data/data_en_zh.dict \\\n--encode_method caesar \\\n--instruction_type Crimes_And_Illegal_Activities \\\n--demonstration_toxicity toxic \\\n--language en\n```\n## 🔧 Argument Specification\n1. `--model_name`: The name of the model to evaluate.\n\n2. `--data_path`: Select the data to run. \n\n3. `--encode_method`: Select the cipher to use.\n\n4. `--instruction_type`: Select the domain of data.\n\n5. `--demonstration_toxicity`: Select the toxic or safe demonstrations.\n\n6. `--language`: Select the language of the data.\n\n\n## 💡Framework\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"paper/Overview.png\" alt=\"Logo\" width=\"500\"\u003e\n\u003c/div\u003e\n\nOur approach presumes that since human feedback and safety alignments are presented in natural language, using a human-unreadable cipher can potentially bypass the safety alignments effectively. Intuitively, we first teach the LLM to comprehend the cipher clearly by designating the LLM as a cipher expert, and elucidating the rules of enciphering and deciphering, supplemented with several demonstrations. We then convert the input into a cipher, which is less likely to be covered by the safety alignment of LLMs, before feeding it to the LLMs.  We finally employ a rule-based decrypter to convert the model output from a cipher format into the natural language form.  \n\n## 📃Results\nThe query-responses pairs in our experiments are all stored in the form of a list in the \"experimental_results\" folder, and torch.load() can be used to load data.\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"paper/main_result_demo.jpg\" alt=\"Logo\" width=\"500\"\u003e\n\u003c/div\u003e\n\n### 🌰Case Study\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"paper/case.png\" alt=\"Logo\" width=\"500\"\u003e\n\u003c/div\u003e\n\n### 🫠Ablation Study\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"paper/ablation.png\" alt=\"Logo\" width=\"500\"\u003e\n\u003c/div\u003e\n\n### 🦙Other Models\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"paper/other_models.png\" alt=\"Logo\" width=\"500\"\u003e\n\u003c/div\u003e\n\n\n\n\n[![Star History Chart](https://api.star-history.com/svg?repos=RobustNLP/CipherChat\u0026type=Date)](https://star-history.com/#RobustNLP/CipherChat\u0026Date)\n\nCommunity Discussion:\n- Twitter: [AIDB](https://twitter.com/ai_database/status/1691655307892830417), [Jiao Wenxiang](https://twitter.com/WenxiangJiao/status/1691363450604457984)\n\n## Citation\n\nIf you find our paper\u0026tool interesting and useful, please feel free to give us a star and cite us through:\n```bibtex\n@inproceedings{\nyuan2024cipherchat,\ntitle={{GPT}-4 Is Too Smart To Be Safe: Stealthy Chat with {LLM}s via Cipher},\nauthor={Youliang Yuan and Wenxiang Jiao and Wenxuan Wang and Jen-tse Huang and Pinjia He and Shuming Shi and Zhaopeng Tu},\nbooktitle={The Twelfth International Conference on Learning Representations},\nyear={2024},\nurl={https://openreview.net/forum?id=MbfAK4s61A}\n}\n\n```\n\n","funding_links":[],"categories":["SDK, Libraries, Frameworks","Privacy and Safety","PoC","Python"],"sub_categories":["Python"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRobustNLP%2FCipherChat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FRobustNLP%2FCipherChat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRobustNLP%2FCipherChat/lists"}