https://github.com/haizelabs/dspy-redteam

Red-Teaming Language Models with DSPy
https://github.com/haizelabs/dspy-redteam

Last synced: 19 days ago
JSON representation

Red-Teaming Language Models with DSPy

Host: GitHub
URL: https://github.com/haizelabs/dspy-redteam
Owner: haizelabs
Created: 2024-03-24T05:02:56.000Z (about 1 year ago)
Default Branch: master
Last Pushed: 2025-02-13T09:15:56.000Z (4 months ago)
Last Synced: 2025-05-06T03:01:57.149Z (26 days ago)
Language: Python
Homepage: https://blog.haizelabs.com/posts/dspy/
Size: 257 KB
Stars: 188
Watchers: 5
Forks: 22
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-dspy - DSPy Red Team - Red teaming / Finding a Prompt attack for an LLM using DSPy. [Article](https://blog.haizelabs.com/posts/dspy/) ([DSPy](https://github.com/stanfordnlp/dspy) - A library for compiling declarative language model calls into self-improving pipelines. / Projects)

README

# Red-Teaming Language Models with DSPy

We use the the power of [DSPy](https://github.com/stanfordnlp/dspy), a framework for structuring and optimizing language model programs, to red-team language models.

To our knowledge, this is the first attempt at using any auto-prompting *framework* to perform the red-teaming task. This is also probably the deepest architecture in public optimized with DSPy to date.

We accomplish this using a *deep* language program with several layers of alternating `Attack` and `Refine` modules in the following optimization loop:

Figure 1: Overview of DSPy for red-teaming. The DSPy MIPRO optimizer, guided by a LLM as a judge, compiles our language program into an effective red-teamer against Vicuna.

The following Table demonstrates the effectiveness of the chosen architecture, as well as the benefit of DSPy compilation:

| **Architecture** | **ASR** |
|:------------:|:----------:|
| None (Raw Input) | 10% |
| Architecture (5 Layer) | 26% |
| Architecture (5 Layer) + Optimization | 44% |

Table 1: ASR with raw harmful inputs, un-optimized architecture, and architecture post DSPy compilation.

With *no specific prompt engineering*, we are able to achieve an Attack Success Rate of 44%, 4x over the baseline. This is by no means the SOTA, but considering how we essentially spent no effort designing the architecture and prompts, and considering how we just used an off-the-shelf optimizer with almost no hyperparameter tuning (except to fit compute constraints), we think it is pretty exciting that we can achieve this result!

Full exposition on the [Haize Labs blog](https://blog.haizelabs.com/posts/dspy/).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/haizelabs/dspy-redteam

Awesome Lists containing this project

README