https://github.com/graph-com/cka-agent
Official Implementation of the CKA-Agent, "The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search".
https://github.com/graph-com/cka-agent
jailbreak llms red-teaming safety
Last synced: 17 days ago
JSON representation
Official Implementation of the CKA-Agent, "The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search".
- Host: GitHub
- URL: https://github.com/graph-com/cka-agent
- Owner: Graph-COM
- License: agpl-3.0
- Created: 2025-11-26T15:05:43.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2025-12-22T03:21:58.000Z (20 days ago)
- Last Synced: 2025-12-22T18:33:56.718Z (20 days ago)
- Topics: jailbreak, llms, red-teaming, safety
- Language: Python
- Homepage: https://cka-agent.github.io/
- Size: 349 KB
- Stars: 126
- Watchers: 1
- Forks: 31
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# CKA-Agent: Bypassing LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search
## 🔥 Latest Results on Frontier Models (Dec 2025)
CKA-Agent demonstrates consistent high attack success rates against the latest frontier models, including **GPT-5.2**, **Gemini-3.0-Pro**, and **Claude-Haiku-4.5**. The results are summarized below:
Model
HarmBench
StrongREJECT
FS ↑
PS ↑
V ↓
R ↓
FS ↑
PS ↑
V ↓
R ↓
🟢 GPT-5.2
0.889
0.079
0.024
0.008
0.932
0.056
0.006
0.006
🟣 Gemini-3.0-Pro
0.881
0.087
0.000
0.032
0.951
0.037
0.006
0.006
🟠Claude-Haiku-4.5
0.960
0.024
0.008
0.008
0.969
0.025
0.006
0.000
> **Metrics:** FS = Full Success, PS = Partial Success, V = Vacuous, R = Refusal. Results collected in December 2025.
## Overview
This repository contains the official implementation of **CKA-Agent**, a novel approach to bypassing the guardrails of commercial large language models (LLMs) through **harmless prompt weaving** and **adaptive tree search** techniques.

## Environment Setup
Install uv
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Create env
```bash
uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
uv pip install accelerate fastchat nltk pandas google-genai httpx[socks] anthropic
```
## Experiment Configuration
Configure your experiments by modifying the `config/config.yml` file. You can control the following aspects:
1. **Test Dataset**: Choose from available datasets like `harmbench_cka` or `strongreject_cka`.
2. **Target Models**: Select black-box or white-box models such as `gpt-oss-120b` or `gemini-2.5-xxx`.
3. **Jailbreak Methods**: Enable and configure various implemented baseline methods.
4. **Evaluations**: Define evaluation metrics and judge models like `gemini-2.5-flash`.
5. **Defense Methods**: Apply different defense mechanisms as needed.
For detailed configuration instructions and examples, please refer to the [configuration README](config/README.md).
### Running Experiments
The `run_experiment.sh` script executes `main.py` to run the entire experiment pipeline (jailbreak and evaluation) by default.
```bash
./run_experiment.sh
```
You can modify the `run_experiment.sh` script or directly pass arguments to `main.py` to run specific phases:
- `full`: Runs the entire pipeline (default).
- `jailbreak`: Runs only the jailbreak methods.
- `judge`: Runs only the evaluation on existing results.
- `resume`: Resumes an interrupted experiment.
**Example (running only the jailbreak phase):**
```bash
python main.py --phase jailbreak
```
## Cite
If you find this repository useful for your research, please consider citing the following paper:
```bibtex
@misc{wei2025trojan,
title={The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search},
author={Rongzhe Wei and Peizhi Niu and Xinjie Shen and Tony Tu and Yifan Li and Ruihan Wu and Eli Chien and Pin-Yu Chen and Olgica Milenkovic and Pan Li},
year={2025},
eprint={2512.01353},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2512.01353},
}
```