{"id":17361216,"url":"https://github.com/annahdo/implementing_activation_steering","last_synced_at":"2025-02-26T12:31:35.582Z","repository":{"id":220661247,"uuid":"742458066","full_name":"annahdo/implementing_activation_steering","owner":"annahdo","description":"A collection of different ways to implement accessing and modifying internal model activations for LLMs","archived":false,"fork":false,"pushed_at":"2024-02-05T17:19:40.000Z","size":41,"stargazers_count":11,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-16T22:22:41.144Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/annahdo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-01-12T14:23:41.000Z","updated_at":"2024-09-30T15:11:46.000Z","dependencies_parsed_at":"2024-02-03T13:27:23.639Z","dependency_job_id":"aff1d819-bf8a-435e-906e-9d8b97531059","html_url":"https://github.com/annahdo/implementing_activation_steering","commit_stats":null,"previous_names":["annahdo/implementing_activation_steering"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/annahdo%2Fimplementing_activation_steering","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/annahdo%2Fimplementing_activation_steering/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/annahdo%2Fimplementing_activation_steering/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/annahdo%2Fimplementing_activation_steering/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/annahdo","download_url":"https://codeload.github.com/annahdo/implementing_activation_steering/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240852567,"owners_count":19868280,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-15T19:31:57.752Z","updated_at":"2025-02-26T12:31:35.553Z","avatar_url":"https://github.com/annahdo.png","language":"Jupyter Notebook","funding_links":[],"categories":["Mechanistic interpretability libraries"],"sub_categories":[],"readme":"# Implementing activation steering\n\nThis repository provides code for different ways to implement [activation steering](https://www.lesswrong.com/tag/activation-engineering) to change the behavior of LLMs. \nSee also this [blogpost](https://www.lesswrong.com/posts/ndyngghzFY388Dnew/implementing-activation-steering).\n\nIt is aimed at people who are new to activation/representation steering/engineering/editing.\nI use GPT2-XL as an example model for the implementation.\n\n## install\nTested with python 3.10. \nMake a new environment and install the libraries in `requirements.txt`.\n```\npip install -r requirements.txt\n```\n\n## General approach to activation steering\n\nThe idea is simple: we just add some vector [(for example the \"Love\" vector)](https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector) to the internal model activations and thus influence the model output in a similar (but sometimes more effective way) to prompting. \nWhat happns internally is _shifting_ the activations into a different region kind of like in the picture below:\n\n\u003cimg src=\"https://github.com/user-attachments/assets/11042cec-3ca0-402b-982f-f7ec1d72e075\" width=\"400\"\u003e\n\nIn general there are a few steps involved which I simplify in the following:\n\n* Decide on a layer $l$ and transformer module $\\phi$ to apply the activation steering to. This is often the residual stream of one of the hidden layers.\n* Define a steering vector. In the simplest case we just take the difference of the activations of two encoded strings like $v=\\phi_l(Love)−\\phi_l(Hate)$. \n* Add the vector to the activation during the forward pass. In the simplest case it's something like $\\tilde{\\phi}_l=\\phi_l+v$.\n\n## Implementations\n\n* [custom_wrapper.ipynb](custom_wrapper.ipynb) - writing your own wrappers to equip modules with additional functionality\n* [transformer_lens.ipynb](transformer_lens.ipynb) - using the [TransfomerLens](https://github.com/neelnanda-io/TransformerLens) library\n* [baukit.ipynb](baukit.ipynb) - using the [baukit](https://github.com/davidbau/baukit) library\n* [pytorch_hooks.ipynb](pytorch_hooks.ipynb) - using [PyTorch hooks](https://pytorch.org/docs/stable/generated/torch.nn.modules.module.register_module_forward_hook.html) directly (TransfomerLens and baukit use PyTorch hooks internally)\n* [bias_editing.ipynb](bias_editing.ipynb) - editing the model bias\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fannahdo%2Fimplementing_activation_steering","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fannahdo%2Fimplementing_activation_steering","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fannahdo%2Fimplementing_activation_steering/lists"}