{"id":45100233,"url":"https://github.com/parameterlab/trap","last_synced_at":"2026-04-04T17:06:19.409Z","repository":{"id":225308930,"uuid":"760085235","full_name":"parameterlab/trap","owner":"parameterlab","description":"Source code of \"TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification\", ACL2024 (findings)","archived":false,"fork":false,"pushed_at":"2024-11-20T14:53:30.000Z","size":1054,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-20T15:41:15.873Z","etag":null,"topics":["acl2024","adversarial-attacks","fingerprint","fingerprinting","large-language-models","llm","research"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/parameterlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-19T18:55:39.000Z","updated_at":"2024-11-20T14:53:34.000Z","dependencies_parsed_at":"2024-07-27T14:43:00.161Z","dependency_job_id":"b869f820-e7ea-400f-9169-6a08c2e8dafb","html_url":"https://github.com/parameterlab/trap","commit_stats":null,"previous_names":["parameterlab/trap"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/parameterlab/trap","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parameterlab%2Ftrap","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parameterlab%2Ftrap/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parameterlab%2Ftrap/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parameterlab%2Ftrap/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/parameterlab","download_url":"https://codeload.github.com/parameterlab/trap/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parameterlab%2Ftrap/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30085788,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T15:40:14.053Z","status":"ssl_error","status_checked_at":"2026-03-04T15:40:13.655Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["acl2024","adversarial-attacks","fingerprint","fingerprinting","large-language-models","llm","research"],"created_at":"2026-02-19T20:00:35.051Z","updated_at":"2026-03-04T16:00:51.196Z","avatar_url":"https://github.com/parameterlab.png","language":"Jupyter Notebook","funding_links":[],"categories":["[↑](#table-of-contents)Tools \u003ca name=\"tools\"\u003e\u003c/a\u003e"],"sub_categories":["Model Identification \u0026 Provenance (Fingerprinting)"],"readme":"# 🪤 TRAP Source Code 🍯\n\n[![arXiv](https://img.shields.io/badge/arXiv-2402.12991-b31b1b.svg)](https://arxiv.org/abs/2402.12991)\n\n![Logos](img/logos.png)\n\n\nSource code of the paper [TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification](https://gubri.eu/publication/trap/) by [Martin Gubri](https://gubri.eu/), [Dennis Ulmer](http://dennisulmer.eu/), [Hwaran Lee](https://hwaranlee.github.io/), [Sangdoo Yun](https://sangdooyun.github.io/) and [Seong Joon Oh](https://coallaoh.github.io/).\n\nDeveloped at [Parameter Lab](https://parameterlab.de/) with the support of [Naver AI Lab](https://clova.ai/en/research/publications.html).\n\n\n\n## Table of Contents\n\n- [🪤 TRAP in a nutshell](#-trap-in-a-nutshell)\n  - [🦹 Motivation](#-motivation)\n  - [🥷 Problem: Black-Box Identity Verification (BBIV)](#-problem-black-box-identity-verification-bbiv)\n  - [🪤 Solution: Targeted Random Adversarial Prompt (TRAP)](#-solution-targeted-random-adversarial-prompt-trap)\n- [Citation](#citation)\n- [Installation](#installation)\n- [Experiments](#experiments)\n- [Credits](#credits)\n\n## 🪤 TRAP in a nutshell\n\n### 🦹 Motivation\n\n- 💧 Private LLMs that cost millions of dollars to train may be leaked by internal or external threats. \n- 🐍 Open-source LLMs are distributed under restrictive licenses that may not be respected. For instances, Microsoft's Orca-2 is distributed under a no-commercial licence, and Meta's usage policy of Llama-2 forbids deceptive usages.\n- 🎭 LLMs do not disclose reliably their identity. For instances, Mixtral-8x7B identifies it-self as FAIR’s BlenderBot 3.0, and we can disguise GPT-3.5 and GPT-4 as Anthropic's Claude or as Llama-2, using deceptive system prompts.\n\nTherefore, we need specific tools to ensure **compliance**. \n\n### 🥷 Problem: Black-Box Identity Verification (BBIV)\n\nA reference LLM (either close or open) can be deployed silently by a third party to power an application. So, we propose a new task, BBIV, of detecting the usage of an LLM in a third-party application, which is critical for assessing compliance.\n\n**Question:** Does this ![third-party application](img/badge_third_party.svg) use our ![reference LLM](img/badge_ref_llm.svg)?\n\n![](img/task-bbiv.v2.png)\n\n\n### 🪤 Solution: Targeted Random Adversarial Prompt (TRAP)\n\nTo solve the BBIV problem, we propose a novel method, TRAP, that uses tuned prompt suffixes to reliably force a specific LLM to answer in a pre-defined way.\n\nTRAP is composed of:\n- ![Instruction](img/badge_instruction.svg) a closed-ended question\n- ![Suffix](img/badge_suffix.svg)\n  - 🔥 20 tunable tokens \n  - ⚙️ optimised on the ![reference LLM](img/badge_ref_llm.svg)\n  - 🎯 to output a specific ![target answer](img/badge_target.svg) chosen at random, here 314\n\n\n![Schema method](img/method-reap.v3.png)\n\n🍯 The final prompt is a honeypot: \n- The suffix forces the reference LLM to output the target number 95-100% of the time\n- The suffix is specific to the reference LLM (\u003c1% average transfer rate to another LLM)\n- TRAP beats the perplexity baseline \n  - Using less output tokens (3-18 tokens vs. 150 tokens)\n  - Perplexity identification is sensible to the type of prompt\n\n\n\n🛡️ Third-party can deploy the ![reference LLM](img/badge_ref_llm.svg) with changes\n- TRAP is robust to generation hyperparameters (usual ranges)\n- TRAP is not robust to some system prompts\n\n![Robustness plot](img/plot_robustness.v3.png)\n\nRead [the full paper](https://arxiv.org/abs/2402.12991) for more details.\n\n### Citation\n\nIf you use our code or our method, kindly consider citing our paper:\n```bibtex\n@inproceedings{gubri2024trap,\n    title = \"{TRAP}: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification\",\n    author = \"Gubri, Martin  and\n      Ulmer, Dennis  and\n      Lee, Hwaran  and\n      Yun, Sangdoo  and\n      Oh, Seong Joon\",\n    editor = \"Ku, Lun-Wei  and\n      Martins, Andre  and\n      Srikumar, Vivek\",\n    booktitle = \"Findings of the Association for Computational Linguistics ACL 2024\",\n    month = aug,\n    year = \"2024\",\n    address = \"Bangkok, Thailand and virtual meeting\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2024.findings-acl.683\",\n    doi = \"10.18653/v1/2024.findings-acl.683\",\n    pages = \"11496--11517\",\n    abstract = \"Large Language Model (LLM) services and models often come with legal rules on *who* can use them and *how* they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel fingerprinting problem of Black-box Identity Verification (BBIV). The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use. We repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target LLM, while other models give random answers. TRAP detects the target LLMs with over 95{\\%} true positive rate at under 0.2{\\%} false positive rate even after a single interaction. TRAP remains effective even if the LLM has minor changes that do not significantly alter the original function.\",\n}\n```\n\n\n## Installation\n\n\n### Dependencies \n\nThe `requirements.txt` file corresponds to CUDA version 12.2 and Python 3.8. \n\n```shell\npip install -r requirements.txt\npip install -e llm_attacks\n```\n\nIf you use another CUDA version, you might need to adapt the requirements, but keep the specified fschat version `pip install fschat==0.2.23`. \n\n### Download models \n\nSet the `HUGGINGFACE_HUB_CACHE` env variable to your desired folder. Adapt the path in all the code accordingly.\n\n```shell\necho \"export HUGGINGFACE_HUB_CACHE='/mnt/hdd-nfs/mgubri/models_hf/'\" \u003e\u003e ~/.bashrc\n\n# login to HF\nhuggingface-cli login\n\n# test HF installation\npython -c \"from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))\"\n```\n\nDownload models from HuggingFace using python:\n\n```python\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nMODELS_NAMES = [\n    \"meta-llama/Llama-2-7b-chat-hf\", \"meta-llama/Llama-2-13b-chat-hf\",\n    \"lmsys/vicuna-7b-v1.3\", \"lmsys/vicuna-13b-v1.3\", \n    \"TheBloke/guanaco-7B-HF\", \"TheBloke/guanaco-13B-HF\"\n]\nfor model_name in MODELS_NAMES:\n    tokenizer = AutoTokenizer.from_pretrained(model_name)\n    model = AutoModelForCausalLM.from_pretrained(model_name)\n```\n\nAdapt all the paths of the models in the configuration files in `detect_llm/configs`.\n\n\n### Download data\n\nDownload `valid.wp_source` from [Kaggle](https://www.kaggle.com/datasets/ratthachat/writing-prompts), and place it in `detect_llm/data/datasets/writing`\n\n\n## Experiments\n\nAll the following command are executed from the `detect_llm` folder.\n\n```shell\ncd detect_llm\n```\n\n### Generate prompt and goal strings\n\n```shell\npython generate_csv.py --n-goals 100 --method random --string-type number --string-length 5 --seed 43  # independent seed to report results \n#python generate_csv.py --n-goals 100 --method random --string-type number --string-length 5 --seed 42  # seed used to debug, and change HPs and XP settings\npython generate_csv.py --n-goals 100 --method random --string-type number --string-length 4 --seed 41\npython generate_csv.py --n-goals 100 --method random --string-type number --string-length 3 --seed 40\n```\n\n### Generate CSV of filtered tokens\n\nSee the notebook `notebooks/tokenizer_numbers.ipynb`\n\n```shell\ncd data/filter_tokens\nln -s filter_token_number_vicuna.csv filter_token_number_vicuna_guanaco.csv\ncd ../..\n```\n\n### Optimize the suffixes\n\nOptimize 100 suffixes for the Llama-2-7B-chat, Guanaco-7B, Vicuna-7B models, and the ensemble of both Guanaco-7B and Vicuna-7B, respectively. \nWe use V-100 GPUs to run all the experiments. You will need 32Gb of VRAM to optimize the suffixes for 7B models. \n\n```shell\nSTR_LENGTH=4 #  3  4  5\nSEED=41      # 40 41 43\nMODEL='llama2' # 'vicuna' 'guanaco' 'vicuna_guanaco'\nN_TRAIN_DATA=10\nSTRING='number'\nMETHOD='random'\nN_STEPS=1500\n\nfor DATA_OFFSET in 0 10 20 30 40 50 60 70 80 90 ; do\n  sh scripts/run_gcg_individual.sh $MODEL $STRING $METHOD ${STR_LENGTH} ${DATA_OFFSET} ${SEED} ${N_TRAIN_DATA} ${N_STEPS}\ndone\n```\n\n### Compute true positive and false positive rates on open models\n\nWe compute the true positive rate, i.e., the probability that the reference model retrieves the targeted answer, and the false positive rate, i.e., the probability that another model provides the targeted answer. We generate 10 answers for each suffix and compute the overall average.\n\n\nTable of transferability.\n\n```shell\nstr_length=4 # 3 4 5\nEXPORT_PATH=\"/mnt/hdd-nfs/mgubri/adv-suffixes/detect_llm/results/method_random/type_number/str_length_${str_length}/transferability/retrieval_rate_table.csv\"\nSUFFIX_MODELS=(\n  \"vicuna\" \n  \"guanaco\" \n  \"llama2\" \n  \"vicuna_guanaco\"\n)\nTARGET_MODELS=(\n  \"vicuna vicuna-7B /mnt/hdd-nfs/mgubri/models_hf/models--lmsys--vicuna-7b-v1.3/snapshots/236eeeab96f0dc2e463f2bebb7bb49809279c6d6/\"\n  \"vicuna vicuna-13B /mnt/hdd-nfs/mgubri/models_hf/models--lmsys--vicuna-13b-v1.3/snapshots/6566e9cb1787585d1147dcf4f9bc48f29e1328d2/\"\n  \"llama-2 llama2-7B /mnt/hdd-nfs/mgubri/models_hf/models--meta-llama--Llama-2-7b-chat-hf/snapshots/08751db2aca9bf2f7f80d2e516117a53d7450235/\"\n  \"llama-2 llama2-13B /mnt/hdd-nfs/mgubri/models_hf/models--meta-llama--Llama-2-13b-chat-hf/snapshots/c2f3ec81aac798ae26dcc57799a994dfbf521496/\"\n  \"guanaco guanaco-7B /mnt/hdd-nfs/mgubri/models_hf/models--TheBloke--guanaco-7B-HF/snapshots/293c24105fa15afa127a2ec3905fdc2a0a3a6dac/\"\n  \"guanaco guanaco-13B /mnt/hdd-nfs/mgubri/models_hf/models--TheBloke--guanaco-13B-HF/snapshots/bd59c700815124df616a17f5b49a0bc51590b231/\"\n)\nfor suffix_model in \"${SUFFIX_MODELS[@]}\"; do\n    SUFFIX_PATH=\"/mnt/hdd-nfs/mgubri/adv-suffixes/detect_llm/results/method_random/type_number/str_length_${str_length}/model_${suffix_model}\" \n    for target_model in \"${TARGET_MODELS[@]}\"; do\n      IFS=' ' read -r target_name target_version target_path \u003c\u003c\u003c \"$target_model\"\n      echo \"**** FROM $suffix_model TO $target_version ****\"\n      python -u compute_results.py --path-suffixes ${SUFFIX_PATH} --model-name $target_name --model-version $target_version --model-path $target_path --export-csv ${EXPORT_PATH} --verbose 1 \n    done\ndone\n```\n\n### Compute false positive rate on close models \n\nWe also generate 10 answers per model and per suffix. We use the same generation hyperparameter as the previous section.\n\n```shell\nstr_length=4 # 3 4 5\nN=10\nfor MODEL in 'llama2' 'vicuna' 'guanaco' 'vicuna_guanaco' ; do\n    PATH_SUFFIXES=\"results/method_random/type_number/str_length_${str_length}/model_${MODEL}\"\n    # openai\n    python get_answer_api.py --path-suffixes ${PATH_SUFFIXES} --n-gen 10 --model-name 'gpt-3.5-turbo-0613' --api-name 'openai' --gen-config-override \"{'temperature': [0.6], 'top_p': [0.9]}\"\n    python get_answer_api.py --path-suffixes ${PATH_SUFFIXES} --n-gen 10 --model-name 'gpt-4-1106-preview' --api-name 'openai' --gen-config-override \"{'temperature': [0.6], 'top_p': [0.9]}\"\n    # claude\n    python get_answer_api.py --path-suffixes ${PATH_SUFFIXES} --n-gen 10 --model-name 'claude-2.1' --api-name 'anthropic' --gen-config-override \"{'temperature': [0.6], 'top_p': [0.9]}\"\n    python get_answer_api.py --path-suffixes ${PATH_SUFFIXES} --n-gen 10 --model-name 'claude-instant-1.2' --api-name 'anthropic' --gen-config-override \"{'temperature': [0.6], 'top_p': [0.9]}\" \ndone\n```\n\n### Robustness\n\nWe compute the robustness of the true positive rate with respect to changes to the reference model.\n\n```shell\nPATH_LLAMA='/mnt/hdd-nfs/mgubri/models_hf/models--meta-llama--Llama-2-7b-chat-hf/snapshots/08751db2aca9bf2f7f80d2e516117a53d7450235/'\nPATH_SUFFIXES=\"/mnt/hdd-nfs/mgubri/adv-suffixes/detect_llm/results/method_random/type_number/str_length_4/model_llama2\"\n```\n\n#### Generation hyperparameters\n\nTemperature\n```shell\nfor temp in 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 ; do\n    echo \"*** Temperature: ${temp} ***\"\n    NEW_GEN_CONF=\"{'temperature': ${temp}}\"\n    python compute_results.py --path-suffixes ${PATH_SUFFIXES} --model-name llama-2 --model-path ${PATH_LLAMA} --verbose 2 --gen-config-override \"${NEW_GEN_CONF}\"\ndone\n```\n\nTop-p\n\n```shell\nfor top_p in 1.0 0.9901107197234477 0.979243460803013 0.9673015081895581 0.9541785824116654 0.939757893723113 0.9239111027123406 0.9064971781236734 0.8873611417252854 0.8663326890536376 0.8432246737594684 0.8178314420665103 0.789927002520161 0.7592630147374724 0.7255665792589857 0.6885378088328238 0.6478471595162576 0.6031324978424272 0.5539958779509593 0.5 ; do\n    echo \"*** Top-p: ${top_p} ***\"\n    NEW_GEN_CONF=\"{'top_p': ${top_p}}\"\n    python compute_results.py --path-suffixes ${PATH_SUFFIXES} --model-name llama-2 --model-path ${PATH_LLAMA} --verbose 2 --gen-config-override \"${NEW_GEN_CONF}\"\ndone\n```\n\nTop-p values on log scale generated with:\n```python\nimport numpy as np\n1.1-np.logspace(np.log10(0.1), np.log10(0.6), 20)\n' '.join([str(x) for x in (1.1-np.logspace(np.log10(0.1), np.log10(0.6), 20)).tolist()])\n```\n\n#### System prompt\n\n```shell\npython compute_results.py --path-suffixes ${PATH_SUFFIXES} --model-name llama-2 --model-path ${PATH_LLAMA} --system-prompt all \n```\n\n\n### Baselines\n\n#### 1. Sample answers\n\nSample 10k answers without suffixes for every open models.\n```shell\nSEED=70\nTARGET_MODELS=(\n  \"llama-2 llama2-7B /mnt/hdd-nfs/mgubri/models_hf/models--meta-llama--Llama-2-7b-chat-hf/snapshots/08751db2aca9bf2f7f80d2e516117a53d7450235/\"\n  \"llama-2 llama2-13B /mnt/hdd-nfs/mgubri/models_hf/models--meta-llama--Llama-2-13b-chat-hf/snapshots/c2f3ec81aac798ae26dcc57799a994dfbf521496/\"\n  \"vicuna vicuna-7B /mnt/hdd-nfs/mgubri/models_hf/models--lmsys--vicuna-7b-v1.3/snapshots/236eeeab96f0dc2e463f2bebb7bb49809279c6d6/\"\n  \"vicuna vicuna-13B /mnt/hdd-nfs/mgubri/models_hf/models--lmsys--vicuna-13b-v1.3/snapshots/6566e9cb1787585d1147dcf4f9bc48f29e1328d2/\"\n  \"guanaco guanaco-7B /mnt/hdd-nfs/mgubri/models_hf/models--TheBloke--guanaco-7B-HF/snapshots/293c24105fa15afa127a2ec3905fdc2a0a3a6dac/\"\n  \"guanaco guanaco-13B /mnt/hdd-nfs/mgubri/models_hf/models--TheBloke--guanaco-13B-HF/snapshots/bd59c700815124df616a17f5b49a0bc51590b231/\"\n)\n\nfor target_model in \"${TARGET_MODELS[@]}\"; do\n  IFS=' ' read -r target_name target_version target_path \u003c\u003c\u003c \"$target_model\"\n  echo \"***** MODEL $target_version *****\"\n  for temp in 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 ; do\n    echo \"** Temperature: ${temp} **\"\n    NEW_GEN_CONF=\"{'temperature': ${temp}, 'top_p':1.0}\"\n    python -u compute_results_baseline.py --n-gen 10000 --n-digits 4 --model-name $target_name --model-version $target_version --model-path $target_path --verbose 2 --export-base-folder /mnt/hdd-nfs/mgubri/adv-suffixes/detect_llm/ --export-sub-folder 'xp_temperature' --gen-config-override \"${NEW_GEN_CONF}\" --seed $SEED\n  done\ndone\n```\n\nSample from OpenAI API.\n```shell\nOPENAI_MODELS=(\n  \"gpt-3.5-turbo-0613\"\n  \"gpt-4-1106-preview\"\n)\n\nfor model in \"${OPENAI_MODELS[@]}\"; do\n  echo \"**** MODEL $model ****\"\n  python -m pdb compute_results_baseline_api.py --api 'openai' --model-name $model --n-gen 10000 --n-digits 4 --system-prompt 'openai' --verbose 2 --export-base-folder . \ndone\n```\n\nSample open models with different system prompts.\n```shell\nSEED=70\nTARGET_MODELS=(\n  \"llama-2 llama2-7B /mnt/hdd-nfs/mgubri/models_hf/models--meta-llama--Llama-2-7b-chat-hf/snapshots/08751db2aca9bf2f7f80d2e516117a53d7450235/\"\n  \"llama-2 llama2-13B /mnt/hdd-nfs/mgubri/models_hf/models--meta-llama--Llama-2-13b-chat-hf/snapshots/c2f3ec81aac798ae26dcc57799a994dfbf521496/\"\n  \"vicuna vicuna-7B /mnt/hdd-nfs/mgubri/models_hf/models--lmsys--vicuna-7b-v1.3/snapshots/236eeeab96f0dc2e463f2bebb7bb49809279c6d6/\"\n  \"vicuna vicuna-13B /mnt/hdd-nfs/mgubri/models_hf/models--lmsys--vicuna-13b-v1.3/snapshots/6566e9cb1787585d1147dcf4f9bc48f29e1328d2/\"\n  \"guanaco guanaco-7B /mnt/hdd-nfs/mgubri/models_hf/models--TheBloke--guanaco-7B-HF/snapshots/293c24105fa15afa127a2ec3905fdc2a0a3a6dac/\"\n  \"guanaco guanaco-13B /mnt/hdd-nfs/mgubri/models_hf/models--TheBloke--guanaco-13B-HF/snapshots/bd59c700815124df616a17f5b49a0bc51590b231/\"\n)\n\nfor target_model in \"${TARGET_MODELS[@]}\"; do\n    IFS=' ' read -r target_name target_version target_path \u003c\u003c\u003c \"$target_model\"\n    echo \"***** MODEL $target_version *****\"\n    for scenario in 'llama-2' 'openai' 'fastchat' 'SHAKESPEARE_WRITING_ASSISTANT' 'IRS_TAX_CHATBOT' 'MARKETING_WRITING_ASSISTANT' 'XBOX_CUSTOMER_SUPPORT_AGENT' 'HIKING_RECOMMENDATION_CHATBOT' 'JSON_FORMATTER_ASSISTANT' ; do\n        echo \"** Scenario system prompt: ${scenario} **\"\n        temp='1.0'\n        NEW_GEN_CONF=\"{'temperature': ${temp}, 'top_p':1.0}\"\n        python -u compute_results_baseline.py --n-gen 10000 --n-digits 4 --model-name $target_name --model-version $target_version --model-path $target_path --verbose 2 --export-base-folder /mnt/hdd-nfs/mgubri/adv-suffixes/detect_llm/ --export-sub-folder 'xp_system_prompt' --gen-config-override \"${NEW_GEN_CONF}\" --seed $SEED --system-prompt \"${scenario}\"\n    done\ndone\n```\n\n\n#### 2. Perplexity\n\nFirst, we generate completions from 10 models using the same prompts across three datasets, with 1000 prompts for each dataset. Each prompt dataset is a different style.\n\nClose models\n```shell\nfor DATASET in 'writing' 'pubmed' 'wiki' ; do\n    echo \"===== Prompts $DATASET =====\"\n    # openai models\n    python baseline_ppl.py gen --dataset=$DATASET --n-prompts=1000 --seed=0 --api openai --model-name gpt-3.5-turbo-0613\n    python baseline_ppl.py gen --dataset=$DATASET --n-prompts=1000 --seed=0 --api openai --model-name gpt-4-1106-preview\n    # anthropic\n    python baseline_ppl.py gen --dataset=$DATASET --n-prompts=1000 --seed=0 --api anthropic --model-name claude-instant-1.2\n    python baseline_ppl.py gen --dataset=$DATASET --n-prompts=1000 --seed=0 --api anthropic --model-name claude-2.1\ndone\n```\n\nOpen models\n```shell\n# launch with env variables in scripts/hyperparameters/baseline_ppl_gen.csv\necho \"***** MODEL ${model_version} *****\"\npython baseline_ppl.py gen --dataset=$DATASET --n-prompts=1000 --seed=0 --model-path \"${model_path}\" --model-name \"${model_version}\" --export-base-folder '/mnt/hdd-nfs/mgubri/adv-suffixes/detect_llm'\n```\n\nSecond, we compute the perplexity of the previously generated texts on the three reference models:\n```shell\n# open models\nGEN_MODELS=(\n  \"llama2-7B /mnt/hdd-nfs/mgubri/models_hf/models--meta-llama--Llama-2-7b-chat-hf/snapshots/08751db2aca9bf2f7f80d2e516117a53d7450235/\"\n  \"llama2-13B /mnt/hdd-nfs/mgubri/models_hf/models--meta-llama--Llama-2-13b-chat-hf/snapshots/c2f3ec81aac798ae26dcc57799a994dfbf521496/\"\n  \"vicuna-7B /mnt/hdd-nfs/mgubri/models_hf/models--lmsys--vicuna-7b-v1.3/snapshots/236eeeab96f0dc2e463f2bebb7bb49809279c6d6/\"\n  \"vicuna-13B /mnt/hdd-nfs/mgubri/models_hf/models--lmsys--vicuna-13b-v1.3/snapshots/6566e9cb1787585d1147dcf4f9bc48f29e1328d2/\"\n  \"guanaco-7B /mnt/hdd-nfs/mgubri/models_hf/models--TheBloke--guanaco-7B-HF/snapshots/293c24105fa15afa127a2ec3905fdc2a0a3a6dac/\"\n  \"guanaco-13B /mnt/hdd-nfs/mgubri/models_hf/models--TheBloke--guanaco-13B-HF/snapshots/bd59c700815124df616a17f5b49a0bc51590b231/\"\n)\n# close models\nGEN_MODELS=(\"gpt-3.5-turbo-0613\" \"gpt-4-1106-preview\" \"claude-instant-1.2\" \"claude-2.1\")\n\nEVAL_MODELS=(\n  \"llama2-7B /mnt/hdd-nfs/mgubri/models_hf/models--meta-llama--Llama-2-7b-chat-hf/snapshots/08751db2aca9bf2f7f80d2e516117a53d7450235/\"\n  \"vicuna-7B /mnt/hdd-nfs/mgubri/models_hf/models--lmsys--vicuna-7b-v1.3/snapshots/236eeeab96f0dc2e463f2bebb7bb49809279c6d6/\"\n  \"guanaco-7B /mnt/hdd-nfs/mgubri/models_hf/models--TheBloke--guanaco-7B-HF/snapshots/293c24105fa15afa127a2ec3905fdc2a0a3a6dac/\"\n)\nDATASETS=('writing' 'pubmed' 'wiki')\nfor dataset in \"${DATASETS[@]}\" ; do\n  echo \"======= DATASET ${dataset} ========\"\n  for gen_model in \"${GEN_MODELS[@]}\"; do\n    IFS=' ' read -r gen_model_version gen_model_path \u003c\u003c\u003c \"$gen_model\"\n    echo \"**** GEN model ${gen_model_version} ****\"\n    PATH_GEN=\"/mnt/hdd-nfs/mgubri/adv-suffixes/detect_llm/results/baseline/ppl/dataset_${dataset}/gen_model_${gen_model_version}/gen_texts_n1000_system_prompt_original_temperature_0.6_top_p_0.9_seed0.csv\"\n    for eval_model in \"${EVAL_MODELS[@]}\"; do\n      IFS=' ' read -r eval_model_version eval_model_path \u003c\u003c\u003c \"$eval_model\"\n      echo \"** EVAL model ${eval_model_version} **\"\n      python baseline_ppl.py eval --dataset=\"${dataset}\" --seed=0 --model-path \"${eval_model_path}\" --model-name \"${eval_model_version}\" --gen-csv \"${PATH_GEN}\"\n    done\n  done\ndone\n```\n\n### Analysis\n\n`notebooks/analyse_results.ipynb` contains Python code to parse the results of the optimization of suffixes\n\n## Reproducibility\n\nTo ease future research, we release the CSV containing our optimized suffixes in `results/method_random/type_number/str_length_{str_length}/model_{model}/suffixes.csv`.\n\n\n## Credits\n\nThe code is under MIT licence.\n\n- The code in `llm_attacks` is derived from [the source code](https://github.com/llm-attacks/llm-attacks) of the paper \"[Universal and Transferable Adversarial Attacks on Aligned Language Models](https://arxiv.org/abs/2307.15043)\" by [Andy Zou](https://andyzoujm.github.io/), [Zifan Wang](https://sites.google.com/west.cmu.edu/zifan-wang/home), [Nicholas Carlini](https://nicholas.carlini.com/), [Milad Nasr](https://people.cs.umass.edu/~milad/), [J. Zico Kolter](https://zicokolter.com/), and [Matt Fredrikson](https://www.cs.cmu.edu/~mfredrik/). \n- The writing preprocessing code of `load_writing()` in `baseline_ppl.py` is derived from \"[DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature](https://github.com/eric-mitchell/detect-gpt/blob/main/custom_datasets.py)\".\n- The code in `detect_llm` was partially developed with GPT-4 as a coding assistant.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparameterlab%2Ftrap","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fparameterlab%2Ftrap","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparameterlab%2Ftrap/lists"}