https://github.com/annahdo/implementing_activation_steering
A collection of different ways to implement accessing and modifying internal model activations for LLMs
https://github.com/annahdo/implementing_activation_steering
Last synced: about 2 months ago
JSON representation
A collection of different ways to implement accessing and modifying internal model activations for LLMs
- Host: GitHub
- URL: https://github.com/annahdo/implementing_activation_steering
- Owner: annahdo
- License: apache-2.0
- Created: 2024-01-12T14:23:41.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-02-05T17:19:40.000Z (about 1 year ago)
- Last Synced: 2024-10-16T22:22:41.144Z (6 months ago)
- Language: Jupyter Notebook
- Size: 40 KB
- Stars: 11
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-interpretability - A tutorial on doing it manually
README
# Implementing activation steering
This repository provides code for different ways to implement [activation steering](https://www.lesswrong.com/tag/activation-engineering) to change the behavior of LLMs.
See also this [blogpost](https://www.lesswrong.com/posts/ndyngghzFY388Dnew/implementing-activation-steering).It is aimed at people who are new to activation/representation steering/engineering/editing.
I use GPT2-XL as an example model for the implementation.## install
Tested with python 3.10.
Make a new environment and install the libraries in `requirements.txt`.
```
pip install -r requirements.txt
```## General approach to activation steering
The idea is simple: we just add some vector [(for example the "Love" vector)](https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector) to the internal model activations and thus influence the model output in a similar (but sometimes more effective way) to prompting.
What happns internally is _shifting_ the activations into a different region kind of like in the picture below:
In general there are a few steps involved which I simplify in the following:
* Decide on a layer $l$ and transformer module $\phi$ to apply the activation steering to. This is often the residual stream of one of the hidden layers.
* Define a steering vector. In the simplest case we just take the difference of the activations of two encoded strings like $v=\phi_l(Love)−\phi_l(Hate)$.
* Add the vector to the activation during the forward pass. In the simplest case it's something like $\tilde{\phi}_l=\phi_l+v$.## Implementations
* [custom_wrapper.ipynb](custom_wrapper.ipynb) - writing your own wrappers to equip modules with additional functionality
* [transformer_lens.ipynb](transformer_lens.ipynb) - using the [TransfomerLens](https://github.com/neelnanda-io/TransformerLens) library
* [baukit.ipynb](baukit.ipynb) - using the [baukit](https://github.com/davidbau/baukit) library
* [pytorch_hooks.ipynb](pytorch_hooks.ipynb) - using [PyTorch hooks](https://pytorch.org/docs/stable/generated/torch.nn.modules.module.register_module_forward_hook.html) directly (TransfomerLens and baukit use PyTorch hooks internally)
* [bias_editing.ipynb](bias_editing.ipynb) - editing the model bias