https://github.com/zjunlp/steer-target-atoms

[ACL 2025] Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms
https://github.com/zjunlp/steer-target-atoms

acl2025 artificial-intelligence controlled-generation easyedit2 knowledge-editing large-language-models model-editing natural-language-processing safety sta steering-behaviors

Last synced: about 1 year ago
JSON representation

[ACL 2025] Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

Host: GitHub
URL: https://github.com/zjunlp/steer-target-atoms
Owner: zjunlp
License: mit
Created: 2025-05-23T10:29:01.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-04T14:10:21.000Z (about 1 year ago)
Last Synced: 2025-06-04T15:52:03.365Z (about 1 year ago)
Topics: acl2025, artificial-intelligence, controlled-generation, easyedit2, knowledge-editing, large-language-models, model-editing, natural-language-processing, safety, sta, steering-behaviors
Language: Python
Homepage:
Size: 410 KB
Stars: 5
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# **Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms**

## 🔧 Pip Installation

To get started, simply install conda:

```shell
conda create -n sta python=3.11 -y
pip install -r requirements.txt
cd ./TransformerLens
pip install -e . # 2.4.0
cd ../trl
pip install -e . # for sft dpo training
```

## 📂 Data Preparation

**Dataset and Steering Vector**

The data for STA can be downloaded [here](https://huggingface.co/datasets/mengru/data_for_STA).

**Directory Structure**

```
steer-target-atoms
└── data
├── mmlu
└── safety
```

## 💻 Run

### Steering vector

#### directly download

If you download data from [here](https://huggingface.co/datasets/mengru/data_for_STA), then you can get the steering vectors used in this paper:

- steering vecotr for Gemma-2-9b-it (./data/safety/toxic_DINM_it/sae_caa_vector_it/gemma-2-9b-it_safety/act_and_fre_trim/steering_vector)

- steering vecotr for Gemma-2-9b-pt (./data/safety/toxic_DINM_pt/sae_caa_vector_pt/gemma-2-9b_safety/act_and_fre_trim/steering_vector)

Then, you can directly go to the [Steering the behaviors of LLMs](#steering-the-behaviors-of-llms) section.

#### Generate the steering vector by yourself

You can also generate these steering vectors using the following steps by yourself:

1. Download the sae

- Download sea for Gemma-2-9b-it from [here](https://huggingface.co/google/gemma-scope-9b-it-res/tree/main/layer_20/width_16k/average_l0_91), then replace the value of sae_paths (in ./scripts/generate_vector/gemma/sta/run_selection_safe_gemma_it_DINM.sh) with your own path.

- Download sea for Gemma-2-9b-pt from [here](https://huggingface.co/google/gemma-scope-9b-pt-res/tree/main/layer_24/width_16k/average_l0_114), then replace the value of sae_paths (in ./scripts/generate_vector/gemma/sta/run_selection_safe_gemma_pt_DINM.sh) with your own path.

2. Genetate steering vector

```shell
bash run_generate_vector.sh
```

### Steering the behaviors of LLMs

You can steering the behaviors of LLMs by **steering vector**

```shell
bash run_main_table.sh
```
> ❗️ You should replace the value of model_name_or_path in the corresponding xx.sh file with your own model path.

### Evaluation

```shell
bash run_eval.sh
```

## 🌟 Some Important Information

This repository is developed for our STA paper. We also release [EasyEdit2](https://github.com/zjunlp/EasyEdit/blob/main/README_2.md), a unified framework for controllable editing without retraining. It integrates multiple steering methods to facilitate usage and evaluation.
Unlike this repository, which depends on TransformerLens, EasyEdit2 is independent of it.

We recommend using [EasyEdit2](https://github.com/zjunlp/EasyEdit/blob/main/README_2.md) for future research and applications.

# 📖 Citation

Please cite our paper if you use **STA** in your work.

```bibtex
@misc{wang2025STA,
title={Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms},
author={Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, Ningyu Zhang},
year={2025},
eprint={2505.20322},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zjunlp/steer-target-atoms

Awesome Lists containing this project

README