{"id":14041934,"url":"https://github.com/AI-secure/AgentPoison","last_synced_at":"2025-07-27T15:31:09.310Z","repository":{"id":246066134,"uuid":"775933440","full_name":"AI-secure/AgentPoison","owner":"AI-secure","description":"[NeurIPS 2024] Official implementation for \"AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning\"","archived":false,"fork":false,"pushed_at":"2024-12-02T20:34:51.000Z","size":596398,"stargazers_count":66,"open_issues_count":4,"forks_count":5,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-12-02T21:32:11.899Z","etag":null,"topics":["llm-agent","red-team","retrieval-augmented-generation"],"latest_commit_sha":null,"homepage":"https://billchan226.github.io/AgentPoison","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AI-secure.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-22T10:39:10.000Z","updated_at":"2024-12-02T20:34:57.000Z","dependencies_parsed_at":"2024-12-02T21:43:04.651Z","dependency_job_id":null,"html_url":"https://github.com/AI-secure/AgentPoison","commit_stats":null,"previous_names":["billchan226/agentpoison","ai-secure/agentpoison"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-secure%2FAgentPoison","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-secure%2FAgentPoison/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-secure%2FAgentPoison/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-secure%2FAgentPoison/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AI-secure","download_url":"https://codeload.github.com/AI-secure/AgentPoison/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227814280,"owners_count":17823876,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm-agent","red-team","retrieval-augmented-generation"],"created_at":"2024-08-12T08:00:41.069Z","updated_at":"2025-07-27T15:31:09.297Z","avatar_url":"https://github.com/AI-secure.png","language":"Python","funding_links":[],"categories":["Repositories","AI Red Teaming (Testing AI Targets)"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/agentpoison_logo.jpg\" width=\"32%\"\u003e\n\u003c/div\u003e\n\n## [AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning](https://billchan226.github.io/AgentPoison)\n\n🔥🔥 Recent news please check **[Project page](https://billchan226.github.io/AgentPoison.html)** !\n\n[![Project Page](https://img.shields.io/badge/Project-Page-Green)](https://billchan226.github.io/AgentPoison.html)\n[![Arxiv](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/pdf/2407.12784)\n[![License: MIT](https://img.shields.io/badge/License-MIT-g.svg)](https://opensource.org/licenses/MIT)\n[![GitHub Stars](https://img.shields.io/github/stars/BillChan226/AgentPoison?style=social)](https://github.com/BillChan226/AgentPoison/stargazers)\n\nThis repository provides the official PyTorch implementation of the following paper:\n\u003e [**AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning**]() \u003cbr\u003e\n\u003e [Zhaorun Chen](https://billchan226.github.io/)\u003csup\u003e1\u003c/sup\u003e,\n\u003e [Zhen Xiang](https://zhenxianglance.github.io/)\u003csup\u003e2\u003c/sup\u003e,\n\u003e [Chaowei Xiao](https://xiaocw11.github.io/) \u003csup\u003e3\u003c/sup\u003e,\n\u003e [Dawn Song](https://dawnsong.io/) \u003csup\u003e4\u003c/sup\u003e,\n\u003e [Bo Li](https://aisecure.github.io/)\u003csup\u003e1,2\u003c/sup\u003e\n\u003e\n\u003e \u003csup\u003e1\u003c/sup\u003eUniversity of Chicago, \u003csup\u003e2\u003c/sup\u003eUniversity of Illinois, Urbana-Champaign \u003cbr\u003e\n\u003csup\u003e3\u003c/sup\u003eUniversity of Wisconsin, Madison, \u003csup\u003e4\u003c/sup\u003eUniversity of California, Berkeley \u003cbr\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/method.png\" width=\"95%\"\u003e\n\u003c/div\u003e\n\n\n## :hammer_and_wrench: Installation\n\nTo install, run the following commands to install the required packages:\n\n```\ngit clone https://github.com/BillChan226/AgentPoison.git\ncd AgentPoison\nconda env create -f environment.yml\nconda activate agentpoison\n```\n\n### RAG Embedder Checkpoints\n\nYou can download the embedder checkpoints from the links below then specify the path to the embedder checkpoints in the `algo/config.yaml` file.\n\n| Embedder             | HF Checkpoints   |\n| -------------------- | ------------------- |\n| [BERT](https://arxiv.org/pdf/1810.04805)    | [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) |\n| [DPR](https://arxiv.org/pdf/2004.04906)     |  [facebook/dpr-question_encoder-single-nq-base](https://huggingface.co/facebook/dpr-question_encoder-single-nq-base) |\n| [ANCE](https://arxiv.org/pdf/2007.00808)     | [castorini/ance-dpr-question-multi](https://huggingface.co/castorini/ance-dpr-question-multi) |\n| [BGE](https://arxiv.org/pdf/2310.07554)   |  [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) |\n| [REALM](https://arxiv.org/pdf/2002.08909)   |  [google/realm-cc-news-pretrained-embedder](https://huggingface.co/google/realm-cc-news-pretrained-embedder) |\n| [ORQA](https://arxiv.org/pdf/1906.00300)   |  [google/realm-orqa-nq-openqa](https://huggingface.co/google/realm-orqa-nq-openqa) |\n\n You can also use custmor embedders (e.g. fine-tuned yourself) as long as you specify their identifier and model path in the [config](algo/config.py).\n\n## :smiling_imp: Trigger Optimization\n\nAfter setting up the configuration for the embedders, you can run trigger optimization for all three agents using the following command:\n```bash\npython algo/trigger_optimization.py --agent ad --algo ap --model dpr-ctx_encoder-single-nq-base --save_dir ./results  --ppl_filter --target_gradient_guidance --asr_threshold 0.5 --num_adv_passage_tokens 10 --golden_trigger -w -p\n```\nSpecifically, the descriptions of arguments are listed below:\n\n| Argument             | Example             | Description   |\n| -------------------- | ------------------- | ------------- |\n| `--agent`    | `ad` | Specify the type of agent to red-team,  [`ad`, `qa`, `ehr`]. |\n| `--algo`     | `ap` | Trigger optimization algorithm to use, [`ap`, `cpa`]. |\n| `--model`     | `dpr-ctx_encoder-single-nq-base` | Target RAG embedder to optimize, see a complete list above. |\n| `--save_dir`   | `./result` | Path to save the optimized trigger and procedural plots |\n| `--num_iter`   | `1000` | Number of iterations to run each gradient optimization |\n| `--num_grad_iter`   | `30` | Number of gradient accumulation steps |\n| `--per_gpu_eval_batch_size`   | `64` | Batch size for trigger optimization |\n| `--num_cand`   | `100` | Number of discrete tokens sampled per optimization |\n| `--num_adv_passage_tokens`   | `10` | Number of tokens in the trigger sequence |\n| `--golden_trigger`   | `False` | Whether to start with a golden trigger (will overwrite `--num_adv_passage_tokens`) |\n| `--target_gradient_guidance`   | `True` | Whether to guide the token update with target model loss |\n| `--use_gpt`   | `False` | Whether to approximate target model loss via MC sampling |\n| `--asr_threshold`   | `0.5` | ASR threshold for target model loss |\n| `--ppl_filter`   | `True` | Whether to enable coherence loss filter for token sampling |\n| `--plot`   | `False` | Whether to plot the procedural optimization of the embeddings |\n| `--report_to_wandb`   | `True` | Whether to report the results to wandb |\n\n\n## :robot: Agent Experiment\n\nWe have modified the original code for [Agent-Driver](https://github.com/USC-GVL/Agent-Driver), [ReAct-StrategyQA](https://github.com/Jiuzhouh/Uncertainty-Aware-Language-Agent), [EHRAgent](https://github.com/wshi83/EhrAgent) to support more RAG embedders, and add interface for data poisoning. We have provided unified dataset access for all three agents at [here](https://drive.google.com/drive/folders/1WNJlgEZA3El6PNudK_onP7dThMXCY60K?usp=sharing). Specifically, we list the inference command for all three agents.\n\n### :car: Agent-Driver\n\nFirst download the corresponding dataset from [here](https://drive.google.com/drive/folders/1WNJlgEZA3El6PNudK_onP7dThMXCY60K?usp=sharing) or the original [dataset host](https://drive.google.com/drive/folders/1BjCYr0xLTkLDN9DrloGYlerZQC1EiPie). Put the corresponding dataset in `agentdriver/data`. \nThen put the optimized trigger tokens in [here](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/motion_planning.py#L184) and you can also determine more attack parameters in [here](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/motion_planning.py#L187). Specifically, set `attack_or_not` to `False` to get the benign utility under attack.\n\nThen run the following script for inference:\n```bash\nsh scripts/agent_driver/run_inference.sh\n```\nThe motion planning result regarding ASR-r, ASR-a, and ACC will be printed directly at the end of the program. The planned trajectory will be saved to `./result`. Run the following command to get ASR-t:\n```bash\nsh scripts/agent_driver/run_evaluation.sh\n```\n\nWe provide more options for red-teaming agent-driver that cover **each individual components of an autonomous agent**, including [perception APIs](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/motion_planning.py#L257), [memory module](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/motion_planning.py#L295), [ego-states](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/motion_planning.py#L327), [mission goal](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/motion_planning.py#L341). \n\nYou need to follow the instruction [here](https://github.com/USC-GVL/Agent-Driver) and fine-tune a motion planner based on GPT-3.5 using [OpenAI's API](https://platform.openai.com/docs/guides/fine-tuning) first. As an alternative, we fine-tune a motion planner based on [LLaMA-3](https://huggingface.co/meta-llama/Meta-Llama-3-8B) in [here](https://huggingface.co/Zhaorun/LLaMA-2-Agent-Driver-Motion-Planner), such that the agent inference can be completely offline. Set `use_local_planner` in [here](https://github.com/BillChan226/AgentPoison/blob/485d9702295ac40010b9a692b22adae18071726c/agentdriver/planning/planning_agent.py#L58) to `True` to enable this.\n\n### :memo: ReAct-StrategyQA\n\nFirst download the corresponding dataset from [here](https://drive.google.com/drive/folders/1WNJlgEZA3El6PNudK_onP7dThMXCY60K?usp=sharing) or the StrategyQA [dataset](https://allenai.org/data/strategyqa). Put the corresponding dataset in `ReAct/database`. \nThen put the optimized trigger tokens in [here](https://github.com/BillChan226/AgentPoison/blob/4de6c5ac5d3ea01f748aff85b9e8b844a3138eb3/ReAct/run_strategyqa_gpt3.5.py#L112). Run the following command to infer with GPT backbone:\n```bash\npython ReAct/run_strategyqa_gpt3.5.py --model dpr --task_type adv\n```\nand similarly to infer with LLaMA-3-70b backbone (you need to first obtain an API key in [Replicate](https://replicate.com/) to access [LLaMA-3](https://replicate.com/meta/meta-llama-3-70b-instruct)) and put it [here](https://github.com/BillChan226/AgentPoison/blob/4de6c5ac5d3ea01f748aff85b9e8b844a3138eb3/ReAct/run_strategyqa_llama3_api.py#L17).\n```bash\npython ReAct/run_strategyqa_llama3_api.py --model dpr --task_type adv\n```\n\nSpecifically, set `--task_type` to `adv` to inject querries with trigger and `benign` to get the benign utility under attack. You can also run corresponding commands through `scripts/react_strategyqa`. The results will be saved to a path indicated by `--save_dir`.\n\n#### Evaluation\n\nTo evaluate the red-teaming performance for StrategyQA, simply run the following command:\n```python\npython ReAct/eval.py -p [RESPONSE_PATH]\n```\n\nwhere `RESPONSE_PATH` is the path to the response json file.\n\n### :man_health_worker: EHRAgent\n\nFirst download the corresponding dataset from [here](https://drive.google.com/drive/folders/1WNJlgEZA3El6PNudK_onP7dThMXCY60K?usp=sharing) and put it under `EhrAgent/database`. \nThen put the optimized trigger tokens in [here](https://github.com/BillChan226/AgentPoison/blob/b8f9d6bb20de5a9fdad0047b85b2645aa9667785/EhrAgent/ehragent/main.py#L90). Run the following command to infer with GPT/LLaMA3:\n```bash\npython EhrAgent/ehragent/main.py --backbone gpt --model dpr --algo ap --attack\n```\n\nYou can specify `--backbone` to `llama3` to infer with LLaMA3, and set `--attack` to `False` to get the benign utility under attack. You can also run corresponding commands through `scripts/ehragent`. The results will be saved to a path indicated by `--save_dir`.\n\n#### Evaluation\n\nTo evaluate the red-teaming performance for EHRAgent, simply run the following command:\n```python\npython EhrAgent/ehragent/eval.py -p [RESPONSE_PATH]\n```\n\nwhere `RESPONSE_PATH` is the path to the response json file.\n\nNote that for each of the agent, you need to run the experiments twice, once with the trigger to get the ASR-r, ASR-a, and ASR-t, and another time without the trigger to get ACC (benign utility).\n\n## :book: Acknowledgement\nPlease cite the paper as follows if you use the data or code from AgentPoison:\n```\n@inproceedings{chenagentpoison,\n  title={AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases},\n  author={Chen, Zhaorun and Xiang, Zhen and Xiao, Chaowei and Song, Dawn and Li, Bo},\n  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}\n}\n```\n\n## :book: Contact\nPlease reach out to us if you have any suggestions or need any help in reproducing the results. You can submit an issue or pull request, or send an email to zhaorun@uchicago.edu.\n\n## :key: License\n\nThis repository is under [MIT License](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAI-secure%2FAgentPoison","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAI-secure%2FAgentPoison","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAI-secure%2FAgentPoison/lists"}