{"id":14041971,"url":"https://github.com/chawins/pal","last_synced_at":"2025-07-27T15:31:04.840Z","repository":{"id":222107853,"uuid":"756216419","full_name":"chawins/pal","owner":"chawins","description":"PAL: Proxy-Guided Black-Box Attack on Large Language Models","archived":false,"fork":false,"pushed_at":"2024-08-17T00:21:35.000Z","size":1003,"stargazers_count":41,"open_issues_count":2,"forks_count":4,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-08-17T01:42:28.893Z","etag":null,"topics":["adversarial-attacks","jailbreak","llm","openai-api","red-teaming"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2402.09674","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chawins.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-12T08:02:49.000Z","updated_at":"2024-08-17T00:21:38.000Z","dependencies_parsed_at":"2024-02-12T12:26:16.348Z","dependency_job_id":"cb085e42-fbda-4c68-b60c-7c58de936e5a","html_url":"https://github.com/chawins/pal","commit_stats":null,"previous_names":["chawins/pal"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chawins%2Fpal","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chawins%2Fpal/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chawins%2Fpal/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chawins%2Fpal/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chawins","download_url":"https://codeload.github.com/chawins/pal/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227814268,"owners_count":17823874,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adversarial-attacks","jailbreak","llm","openai-api","red-teaming"],"created_at":"2024-08-12T08:00:41.854Z","updated_at":"2024-12-02T22:31:27.795Z","avatar_url":"https://github.com/chawins.png","language":"Python","funding_links":[],"categories":["Repositories"],"sub_categories":[],"readme":"# PAL: Proxy-Guided Black-Box Attack on Large Language Models\n\nChawin Sitawarin\u003csup\u003e1\u003c/sup\u003e \u0026nbsp; Norman Mu\u003csup\u003e1\u003c/sup\u003e \u0026nbsp; David Wagner\u003csup\u003e1\u003c/sup\u003e \u0026nbsp; Alexandre Araujo\u003csup\u003e2\u003c/sup\u003e\n\n\u003csup\u003e1\u003c/sup\u003eUniversity of California, Berkeley \u0026nbsp; \u003csup\u003e2\u003c/sup\u003eNew York University\n\n\u003e UPDATE (March 17, 2024): This [change](https://twitter.com/brianryhuang/status/1763438814515843119) made by OpenAI may affect the success of this attack.\n\n## Abstract\n\nLarge Language Models (LLMs) have surged in popularity in recent months, but they have demonstrated concerning capabilities to generate harmful content when manipulated. While techniques like safety fine-tuning aim to minimize harmful use, recent works have shown that LLMs remain vulnerable to attacks that elicit toxic responses. In this work, we introduce the Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack on LLMs in a black-box query-only setting. In particular, it relies on a surrogate model to guide the optimization and a sophisticated loss designed for real-world LLM APIs. Our attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, compared to 4% for the current state of the art. We also propose GCG++, an improvement to the GCG attack that reaches 94% ASR on white-box Llama-2-7B, and the Random-Search Attack on LLMs (RAL), a strong but simple baseline for query-based attacks. We believe the techniques proposed in this work will enable more comprehensive safety testing of LLMs and, in the long term, the development of better security guardrails.\n\n## Install Dependencies\n\nNecessary packages with recommended versions are listed in `requirements.txt`. Run `pip install -r requirements.txt` to install these packages.\n\nIf you wish to install manually, our code is built on top of [TDC 2023 starter kit](https://github.com/centerforaisafety/tdc2023-starter-kit/tree/main/red_teaming).\nSo you can install all the required packages there and then install the additional dependencies below.\n\n```bash\npip install python-dotenv anthropic tenacity google-generativeai num2words bitsandbytes tiktoken sentencepiece torch_optimizer absl-py ml_collections jaxtyping cohere openai\npip install --extra-index-url https://download.pytorch.org/whl/test/cu118 llama-recipes\npip install transformers fschat\n```\n\n## Example\n\n```bash\n# Run GCG attack on llama-2-7b-chat-hf model with a small batch size of 16\nbash example_run_main.sh\n# Gather experiment results and print them as a table (more detail later)\npython gather_results.py\n```\n\n- The main file uses `ml_collections`'s `ConfigDict` for attack-related parameters and the usual Python's `argparse` for the other parameters (selecting scenario and behaviors, etc.).\n- Each attack comes with its own config file in `./configs/ATTACK_NAME.py`.\n- `--behaviors 0 1 3`: Use `behaviors` flag to specify which behaviors to attack (example here is behaviors at indices 0, 1, and 3).\n\n### Where to find the attack results\n\n- Log path is given by `./results/\u003cMODEL\u003e/\u003cATTACK\u003e/\u003cEXP\u003e/\u003cSCENARIO\u003e_\u003cBEHAVIOR\u003e.jsonl`. Example: `./results/Llama-2-7b-chat-hf/ral/len20_100step_seed20_static_bs512_uniform_t1.0_c8-1/Toxicity_0.jsonl`\n- The default log dir is set to `./results/`, but it can be specified with `--log_dir` flag.\n- `\u003cATTACK\u003e` and `\u003cEXP\u003e` are the attack name and experiment name defined in the attack file (e.g., `./src/attacks/gcg.py`). See `_get_name_tokens()`.\n\n### Reproducibility\n\n**When the random seed is set the following step is unnecessary.**\nTo (supposedly) remove the randomness for further debugging, we set the following flags\n\nIn the bash script or before running `main.py`:\n\n```bash\nexport CUBLAS_WORKSPACE_CONFIG=:4096:8\n```\n\nIn `main.py`:\n\n```python\ntorch.use_deterministic_algorithms(True)\n```\n\nEven when these two are enabled, we observe a slight difference between gradients computed with and without KV-cache (may also be due to half precision).\nFor example, in the GCG attack, the order of the top-k can shift slightly when k is large, but overall most of the tokens are the same.\nThis can result in a different adversarial suffix.\n\n## Code Structure\n\n### Main Files\n\n- `main.py`: Main file for running attacks.\n- `gather_results.py`: Gather results from log files and print them as a table.\n\n### `Src`\n\nMost of the attack and model code is in `src/`.\n\n- `attacks` contains all the attack algorithm. To add a new attack, create a new file in this directory and import and add your attack to `_ATTACKS_DICT` in `attacks/__init__.py`. We highly recommend extending `BaseAttack` class in `attacks/base.py` for your attack. See `attacks/gcg.py` or `attacks/ral.py` for examples.\n  - `attacks/gcg.py`: contains our GCG++ which is built from a minimal version of the original GCG attack ([code](https://github.com/llm-attacks/llm-attacks), [paper](https://arxiv.org/abs/2307.15043)).\n  - `attacks/ral.py`: Our RAL attack.\n  - `attacks/pal.py`: Our PAL attack.\n- `models` contains various model interfaces.\n- `utils` contains utility functions called by main files or shared across the other modules.\n\n## Attacks\n\n### PAL Attack\n\nTo fine-tune the proxy model, `config.finetune=True`. Below are the available fine-tuning options.\n\n- Fine-tune with pure `bfloat16`: `config.pure_bf16=True`. This is recommended and uses much less memory than `float16`.\n- Fine-tune with mixed precision (`float16`): `config.use_fp16=True`. Both `use_fp16` and `pure_bf16` cannot be `True` at the same time.\n- Fine-tune with PEFT: `config.use_peft=True`. This is not compatible with `use_fp16` or `pure_bf16` (yet).\n- Fine-tune with PEFT and quantization (`int8`): `config.use_peft=True` and `config.quantize=True`.\n\nNotes\n\n- Use a larger learning rate when fine-tuning with PEFT (e.g., `1e-3`).\n- For 7B models and `pure_bf16` on one A100, `config.mini_batch_size \u003c= 128` and `config.proxy_tune_bs \u003c 64`. `proxy_tune_bs` of 64 will fail on some longer prompts. Use `proxy_tune_bs` of 32 to be safe.\n- Cannot train a 7B model with `use_fp16` on one A100 even with a batch size of 1.\n\n### OpenAI API\n\n- Setting `seed` and `temperature` to `0` does not guarantee that the results are deterministic. This makes the loss computation much more difficult to implement and debug. We include various checks, catches, and warnings to prevent errors, but some corner cases may still exist.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchawins%2Fpal","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchawins%2Fpal","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchawins%2Fpal/lists"}