An open API service indexing awesome lists of open source software.

https://github.com/annahdo/implementing_activation_steering

A collection of different ways to implement accessing and modifying internal model activations for LLMs
https://github.com/annahdo/implementing_activation_steering

Last synced: about 2 months ago
JSON representation

A collection of different ways to implement accessing and modifying internal model activations for LLMs

Awesome Lists containing this project

README

        

# Implementing activation steering

This repository provides code for different ways to implement [activation steering](https://www.lesswrong.com/tag/activation-engineering) to change the behavior of LLMs.
See also this [blogpost](https://www.lesswrong.com/posts/ndyngghzFY388Dnew/implementing-activation-steering).

It is aimed at people who are new to activation/representation steering/engineering/editing.
I use GPT2-XL as an example model for the implementation.

## install
Tested with python 3.10.
Make a new environment and install the libraries in `requirements.txt`.
```
pip install -r requirements.txt
```

## General approach to activation steering

The idea is simple: we just add some vector [(for example the "Love" vector)](https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector) to the internal model activations and thus influence the model output in a similar (but sometimes more effective way) to prompting.
What happns internally is _shifting_ the activations into a different region kind of like in the picture below:

In general there are a few steps involved which I simplify in the following:

* Decide on a layer $l$ and transformer module $\phi$ to apply the activation steering to. This is often the residual stream of one of the hidden layers.
* Define a steering vector. In the simplest case we just take the difference of the activations of two encoded strings like $v=\phi_l(Love)−\phi_l(Hate)$.
* Add the vector to the activation during the forward pass. In the simplest case it's something like $\tilde{\phi}_l=\phi_l+v$.

## Implementations

* [custom_wrapper.ipynb](custom_wrapper.ipynb) - writing your own wrappers to equip modules with additional functionality
* [transformer_lens.ipynb](transformer_lens.ipynb) - using the [TransfomerLens](https://github.com/neelnanda-io/TransformerLens) library
* [baukit.ipynb](baukit.ipynb) - using the [baukit](https://github.com/davidbau/baukit) library
* [pytorch_hooks.ipynb](pytorch_hooks.ipynb) - using [PyTorch hooks](https://pytorch.org/docs/stable/generated/torch.nn.modules.module.register_module_forward_hook.html) directly (TransfomerLens and baukit use PyTorch hooks internally)
* [bias_editing.ipynb](bias_editing.ipynb) - editing the model bias