https://github.com/annahdo/implementing_activation_steering

A collection of different ways to implement accessing and modifying internal model activations for LLMs
https://github.com/annahdo/implementing_activation_steering

Last synced: 5 months ago
JSON representation

A collection of different ways to implement accessing and modifying internal model activations for LLMs

Host: GitHub
URL: https://github.com/annahdo/implementing_activation_steering
Owner: annahdo
License: apache-2.0
Created: 2024-01-12T14:23:41.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-02-05T17:19:40.000Z (over 1 year ago)
Last Synced: 2024-10-16T22:22:41.144Z (9 months ago)
Language: Jupyter Notebook
Size: 40 KB
Stars: 11
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-interpretability - A tutorial on doing it manually

README

        # Implementing activation steering

This repository provides code for different ways to implement [activation steering](https://www.lesswrong.com/tag/activation-engineering) to change the behavior of LLMs. 

See also this [blogpost](https://www.lesswrong.com/posts/ndyngghzFY388Dnew/implementing-activation-steering).

It is aimed at people who are new to activation/representation steering/engineering/editing.

I use GPT2-XL as an example model for the implementation.

## install

Tested with python 3.10. 

Make a new environment and install the libraries in `requirements.txt`.

```

pip install -r requirements.txt

```

## General approach to activation steering

The idea is simple: we just add some vector [(for example the "Love" vector)](https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector) to the internal model activations and thus influence the model output in a similar (but sometimes more effective way) to prompting. 

What happns internally is _shifting_ the activations into a different region kind of like in the picture below:



In general there are a few steps involved which I simplify in the following:

* Decide on a layer $l$ and transformer module $\phi$ to apply the activation steering to. This is often the residual stream of one of the hidden layers.

* Define a steering vector. In the simplest case we just take the difference of the activations of two encoded strings like $v=\phi_l(Love)−\phi_l(Hate)$. 

* Add the vector to the activation during the forward pass. In the simplest case it's something like $\tilde{\phi}_l=\phi_l+v$.

## Implementations

* [custom_wrapper.ipynb](custom_wrapper.ipynb) - writing your own wrappers to equip modules with additional functionality

* [transformer_lens.ipynb](transformer_lens.ipynb) - using the [TransfomerLens](https://github.com/neelnanda-io/TransformerLens) library

* [baukit.ipynb](baukit.ipynb) - using the [baukit](https://github.com/davidbau/baukit) library

* [pytorch_hooks.ipynb](pytorch_hooks.ipynb) - using [PyTorch hooks](https://pytorch.org/docs/stable/generated/torch.nn.modules.module.register_module_forward_hook.html) directly (TransfomerLens and baukit use PyTorch hooks internally)

* [bias_editing.ipynb](bias_editing.ipynb) - editing the model bias

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/annahdo/implementing_activation_steering

Awesome Lists containing this project

README