An open API service indexing awesome lists of open source software.

https://github.com/graph-com/cka-agent

Official Implementation of the CKA-Agent, "The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search".
https://github.com/graph-com/cka-agent

jailbreak llms red-teaming safety

Last synced: 17 days ago
JSON representation

Official Implementation of the CKA-Agent, "The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search".

Awesome Lists containing this project

README

          

# CKA-Agent: Bypassing LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search


arXiv


Website


GitHub code


Cite


Python

## 🔥 Latest Results on Frontier Models (Dec 2025)

CKA-Agent demonstrates consistent high attack success rates against the latest frontier models, including **GPT-5.2**, **Gemini-3.0-Pro**, and **Claude-Haiku-4.5**. The results are summarized below:



Model
HarmBench
StrongREJECT


FS ↑
PS ↑
V ↓
R ↓
FS ↑
PS ↑
V ↓
R ↓




🟢 GPT-5.2
0.889
0.079
0.024
0.008
0.932
0.056
0.006
0.006


🟣 Gemini-3.0-Pro
0.881
0.087
0.000
0.032
0.951
0.037
0.006
0.006


🟠 Claude-Haiku-4.5
0.960
0.024
0.008
0.008
0.969
0.025
0.006
0.000

> **Metrics:** FS = Full Success, PS = Partial Success, V = Vacuous, R = Refusal. Results collected in December 2025.

## Overview
This repository contains the official implementation of **CKA-Agent**, a novel approach to bypassing the guardrails of commercial large language models (LLMs) through **harmless prompt weaving** and **adaptive tree search** techniques.

![CKA-Agent](./assets/comparsion.png)

## Environment Setup
Install uv
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

Create env
```bash
uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
uv pip install accelerate fastchat nltk pandas google-genai httpx[socks] anthropic
```

## Experiment Configuration

Configure your experiments by modifying the `config/config.yml` file. You can control the following aspects:

1. **Test Dataset**: Choose from available datasets like `harmbench_cka` or `strongreject_cka`.
2. **Target Models**: Select black-box or white-box models such as `gpt-oss-120b` or `gemini-2.5-xxx`.
3. **Jailbreak Methods**: Enable and configure various implemented baseline methods.
4. **Evaluations**: Define evaluation metrics and judge models like `gemini-2.5-flash`.
5. **Defense Methods**: Apply different defense mechanisms as needed.

For detailed configuration instructions and examples, please refer to the [configuration README](config/README.md).

### Running Experiments

The `run_experiment.sh` script executes `main.py` to run the entire experiment pipeline (jailbreak and evaluation) by default.

```bash
./run_experiment.sh
```

You can modify the `run_experiment.sh` script or directly pass arguments to `main.py` to run specific phases:

- `full`: Runs the entire pipeline (default).
- `jailbreak`: Runs only the jailbreak methods.
- `judge`: Runs only the evaluation on existing results.
- `resume`: Resumes an interrupted experiment.

**Example (running only the jailbreak phase):**
```bash
python main.py --phase jailbreak
```

## Cite
If you find this repository useful for your research, please consider citing the following paper:

```bibtex
@misc{wei2025trojan,
title={The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search},
author={Rongzhe Wei and Peizhi Niu and Xinjie Shen and Tony Tu and Yifan Li and Ruihan Wu and Eli Chien and Pin-Yu Chen and Olgica Milenkovic and Pan Li},
year={2025},
eprint={2512.01353},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2512.01353},
}
```