{"id":17361211,"url":"https://github.com/ndif-team/nnsight","last_synced_at":"2025-02-26T12:31:32.794Z","repository":{"id":202623673,"uuid":"707771473","full_name":"ndif-team/nnsight","owner":"ndif-team","description":"The nnsight package enables interpreting and manipulating the internals of deep learned models.","archived":false,"fork":false,"pushed_at":"2024-10-24T20:35:49.000Z","size":44895,"stargazers_count":388,"open_issues_count":29,"forks_count":36,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-10-25T06:27:32.864Z","etag":null,"topics":["interpretability","machine-learning","neural-networks","python","pytorch"],"latest_commit_sha":null,"homepage":"https://nnsight.net/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ndif-team.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"docs/contributing.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-20T16:09:58.000Z","updated_at":"2024-10-25T02:07:45.000Z","dependencies_parsed_at":"2024-03-17T21:37:45.667Z","dependency_job_id":"0b7280ea-9ad6-42d4-a91a-5e271d00a130","html_url":"https://github.com/ndif-team/nnsight","commit_stats":null,"previous_names":["jadenfiotto-kaufman/nnsight","ndif-team/nnsight"],"tags_count":58,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ndif-team%2Fnnsight","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ndif-team%2Fnnsight/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ndif-team%2Fnnsight/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ndif-team%2Fnnsight/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ndif-team","download_url":"https://codeload.github.com/ndif-team/nnsight/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240852528,"owners_count":19868272,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["interpretability","machine-learning","neural-networks","python","pytorch"],"created_at":"2024-10-15T19:31:57.497Z","updated_at":"2025-02-26T12:31:32.770Z","avatar_url":"https://github.com/ndif-team.png","language":"Jupyter Notebook","funding_links":[],"categories":["XAI Libraries for NLP","Python","Interpretability \u0026 Mechanistic Analysis","10. AI Safety, Alignment \u0026 Interpretability","Mechanistic interpretability libraries"],"sub_categories":["Libraries \u0026 Frameworks"],"readme":"\u003cimg src=\"./docs/source/_static/images/nnsight_logo.svg\" alt=\"drawing\" style=\"width:200px;float:left\"/\u003e\n\n# nnsight \n\n\u003ca href=\"https://arxiv.org/abs/2407.14561\"\u003e\u003cimg src=\"https://img.shields.io/badge/READ%20THE%20PAPER%20HERE!-orange\" style=\"transform: scale(3);\"\u003e\u003c/a\u003e\n\n\u003ca href=\"https://www.nnsight.net\"\u003e\u003cimg src=\"https://img.shields.io/badge/-Read%20the%20Docs%20Here-blue?style=for-the-badge\u0026logo=Read-the-Docs\u0026logoColor=white\"\u003e\u003c/img\u003e\u003c/a\u003e \u003ca href=\"https://discord.gg/6uFJmCSwW7\"\u003e\u003cimg src=\"https://img.shields.io/badge/Discord-5865F2?style=for-the-badge\u0026logo=discord\u0026logoColor=white\"\u003e\u003c/a\u003e\n\nThe `nnsight`  package enables interpreting and manipulating the internals of deep learned models. Read our [paper!](https://arxiv.org/abs/2407.14561)\n\n#### Installation\n\nInstall this package through pip by running:\n\n`pip install nnsight`\n\n#### Examples\n\nHere is a simple example where we run the nnsight API locally on gpt2 and save the hidden states of the last layer:\n\n```python\nfrom nnsight import LanguageModel\n\nmodel = LanguageModel('openai-community/gpt2', device_map='auto')\n\nwith model.trace('The Eiffel Tower is in the city of') as tracer:\n\n      hidden_states = model.transformer.h[-1].output[0].save()\n\n      output = model.output.save()\n```\n\nLets go over this piece by piece.\n\nWe import the `LanguageModel` object from the `nnsight` module and create a gpt2 model using the huggingface repo ID for gpt2, `'openai-community/gpt2'`. This accepts arguments to create the model including `device_map` to specify which device to run on.\n\n```python\nfrom nnsight import LanguageModel\n\nmodel = LanguageModel('openai-community/gpt2',device_map='auto')\n```\n\nThen, we create a tracing context block by calling `.trace(...)` on the model object. This denotes we want to run the model with our prompt.\n\n\n```python\nwith model.trace('The Eiffel Tower is in the city of') as tracer:\n```\n\nNow calling `.trace(...)` does not actually initialize or run the model. Only after the tracing` block is exited, is the actual model loaded and ran. All operations in the block are \"proxies\" which essentially creates a graph of operations we wish to carry out later.\n\nWithin this context, all operations/interventions will be applied to the processing of the given prompt.\n\n```python\nhidden_states = model.transformer.h[-1].output[0].save()\n```\n\nOn this line were saying, access the last layer of the transformer `model.transformer.h[-1]`, access its output `.output`, index it at 0 `.output[0]`, and save it `.save()`\n\nA few things, we can see the module tree of the model by printing the model. This allows us to know what attributes to access to get to the module we need.\nRunning `print(model)` results in:\n\n```\nGPT2LMHeadModel(\n  (transformer): GPT2Model(\n    (wte): Embedding(50257, 768)\n    (wpe): Embedding(1024, 768)\n    (drop): Dropout(p=0.1, inplace=False)\n    (h): ModuleList(\n      (0-11): 12 x GPT2Block(\n        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n        (attn): GPT2Attention(\n          (c_attn): Conv1D()\n          (c_proj): Conv1D()\n          (attn_dropout): Dropout(p=0.1, inplace=False)\n          (resid_dropout): Dropout(p=0.1, inplace=False)\n        )\n        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n        (mlp): GPT2MLP(\n          (c_fc): Conv1D()\n          (c_proj): Conv1D()\n          (act): NewGELUActivation()\n          (dropout): Dropout(p=0.1, inplace=False)\n        )\n      )\n    )\n    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n  )\n  (lm_head): Linear(in_features=768, out_features=50257, bias=False)\n)\n```\n\n`.output` returns a proxy for the output of this module. This essentially means were saying, when we get to the output of this module during inference, grab it and perform any operations we define on it (which also become proxies). There are two operational proxies here, one for getting the 0th index of the output, and one for saving the output. We take the 0th index because the output of gpt2 transformer layers are a tuple where the first index are the actual hidden states (last two indicies are from attention). We can call `.shape` on any proxies to get what shape the value will eventually be. \nRunning `print(model.transformer.h[-1].output.shape)` returns `(torch.Size([1, 10, 768]), (torch.Size([1, 12, 10, 64]), torch.Size([1, 12, 10, 64])))`\n\nDuring processing of the intervention computational graph we are building, when the value of a proxy is no longer ever needed, its value is dereferenced and destroyed. However calling `.save()` on the proxy informs the computation graph to save the value of this proxy and never destroy it, allowing us to access to value after generation.\n\nAfter exiting the generator context, the model is ran with the specified arguments and intervention graph. `output` is populated with the actual output and `hidden_states` will contain the hidden value.\n\n```python\nprint(output)\nprint(hidden_states)\n```\n\nreturns:\n\n```\ntensor([[ 464,  412,  733,  417, 8765,  318,  287,  262, 1748,  286, 6342]],\n       device='cuda:0')\ntensor([[[ 0.0505, -0.1728, -0.1690,  ..., -1.0096,  0.1280, -1.0687],\n         [ 8.7494,  2.9057,  5.3024,  ..., -8.0418,  1.2964, -2.8677],\n         [ 0.2960,  4.6686, -3.6642,  ...,  0.2391, -2.6064,  3.2263],\n         ...,\n         [ 2.1537,  6.8917,  3.8651,  ...,  0.0588, -1.9866,  5.9188],\n         [-0.4460,  7.4285, -9.3065,  ...,  2.0528, -2.7946,  0.5556],\n         [ 6.6286,  1.7258,  4.7969,  ...,  7.6714,  3.0682,  2.0481]]],\n       device='cuda:0')\n```\n\n\n\n---\n\n###### Operations\n\nMost basic operations and torch operations work on proxies and are added to the computation graph. \n\n```python\nfrom nnsight import LanguageModel\nimport torch \n\nmodel = LanguageModel('openai-community/gpt2', device_map='cuda')\n\nwith model.trace('The Eiffel Tower is in the city of'):\n\n  hidden_states_pre = model.transformer.h[-1].output[0].save()\n\n  hs_sum = torch.sum(hidden_states_pre).save()\n\n  hs_edited = hidden_states_pre + hs_sum\n\n  hs_edited = hs_edited.save()\n\nprint(hidden_states_pre)\nprint(hs_sum)\nprint(hs_edited)\n```\n\nIn this example we get the sum of the hidden states and add them to the hidden_states themselves (for whatever reason). By saving the various steps, we can see how the values change.\n\n```\ntensor([[[ 0.0505, -0.1728, -0.1690,  ..., -1.0096,  0.1280, -1.0687],\n         [ 8.7494,  2.9057,  5.3024,  ..., -8.0418,  1.2964, -2.8677],\n         [ 0.2960,  4.6686, -3.6642,  ...,  0.2391, -2.6064,  3.2263],\n         ...,\n         [ 2.1537,  6.8917,  3.8651,  ...,  0.0588, -1.9866,  5.9188],\n         [-0.4460,  7.4285, -9.3065,  ...,  2.0528, -2.7946,  0.5556],\n         [ 6.6286,  1.7258,  4.7969,  ...,  7.6714,  3.0682,  2.0481]]],\n       device='cuda:0')\ntensor(501.2957, device='cuda:0')\ntensor([[[501.3461, 501.1229, 501.1267,  ..., 500.2860, 501.4237, 500.2270],\n         [510.0451, 504.2014, 506.5981,  ..., 493.2538, 502.5920, 498.4279],\n         [501.5916, 505.9643, 497.6315,  ..., 501.5348, 498.6892, 504.5219],\n         ...,\n         [503.4493, 508.1874, 505.1607,  ..., 501.3545, 499.3091, 507.2145],\n         [500.8496, 508.7242, 491.9892,  ..., 503.3485, 498.5010, 501.8512],\n         [507.9242, 503.0215, 506.0926,  ..., 508.9671, 504.3639, 503.3438]]],\n       device='cuda:0')\n       \n```\n\n---\n###### Setting\n\nWe often not only want to see whats happening during computation, but intervene and edit the flow of information. \n\n```python\nfrom nnsight import LanguageModel\nimport torch \n\nmodel = LanguageModel('openai-community/gpt2', device_map='cuda')\n\nwith model.trace('The Eiffel Tower is in the city of') as tracer:\n\n  hidden_states_pre = model.transformer.h[-1].mlp.output.clone().save()\n\n  noise = (0.001**0.5)*torch.randn(hidden_states_pre.shape)\n\n  model.transformer.h[-1].mlp.output = hidden_states_pre + noise\n\n  hidden_states_post = model.transformer.h[-1].mlp.output.save()\n\nprint(hidden_states_pre)\nprint(hidden_states_post)\n```\nIn this example, we create a tensor of noise to add to the hidden states. We then add it, use the assigment `=` operator to update the value of `.output` with these new noised activations. \n\nWe can see the change in the results:\n\n```\ntensor([[[ 0.0505, -0.1728, -0.1690,  ..., -1.0096,  0.1280, -1.0687],\n         [ 8.7494,  2.9057,  5.3024,  ..., -8.0418,  1.2964, -2.8677],\n         [ 0.2960,  4.6686, -3.6642,  ...,  0.2391, -2.6064,  3.2263],\n         ...,\n         [ 2.1537,  6.8917,  3.8651,  ...,  0.0588, -1.9866,  5.9188],\n         [-0.4460,  7.4285, -9.3065,  ...,  2.0528, -2.7946,  0.5556],\n         [ 6.6286,  1.7258,  4.7969,  ...,  7.6714,  3.0682,  2.0481]]],\n       device='cuda:0')\ntensor([[[ 0.0674, -0.1741, -0.1771,  ..., -0.9811,  0.1972, -1.0645],\n         [ 8.7080,  2.9067,  5.2924,  ..., -8.0253,  1.2729, -2.8419],\n         [ 0.2611,  4.6911, -3.6434,  ...,  0.2295, -2.6007,  3.2635],\n         ...,\n         [ 2.1859,  6.9242,  3.8666,  ...,  0.0556, -2.0282,  5.8863],\n         [-0.4568,  7.4101, -9.3698,  ...,  2.0630, -2.7971,  0.5522],\n         [ 6.6764,  1.7416,  4.8027,  ...,  7.6507,  3.0754,  2.0218]]],\n       device='cuda:0')\n```\n\n---\n###### Multiple Token Generation\n\nWhen generating more than one token, use `.generate(...) ` and `.next()`  on the module you want to get the next value of to denote following interventions should be applied to the subsequent generations.\n\nHere we again generate using gpt2, but generate three tokens and save the hidden states of the last layer for each one:\n\n```python\nfrom nnsight import LanguageModel\n\nmodel = LanguageModel('openai-community/gpt2', device_map='cuda')\n\nwith model.generate('The Eiffel Tower is in the city of', max_new_tokens=3) as tracer:\n \n  hidden_states1 = model.transformer.h[-1].output[0].save()\n\n  invoker.next()\n\n  hidden_states2 = model.transformer.h[-1].next().output[0].save()\n\n  invoker.next()\n\n  hidden_states3 = model.transformer.h[-1].next().output[0].save()\n\n```\n---\n\n###### Cross Prompt Intervention\n\n\nIntervention operations work cross prompt! Use two invocations within the same generation block and operations can work between them.\n\nYou can do this by not passing a prompt into `.trace`/`.generate`, but by calling `.invoke(...)` on the created tracer object.\n\nIn this case, we grab the token embeddings coming from the first prompt, `\"Madison square garden is located in the city of New\"` and replace the embeddings of the second prompt with them.\n\n```python\nfrom nnsight import LanguageModel\n\nmodel = LanguageModel('openai-community/gpt2', device_map='cuda')\n\nwith model.generate(max_new_tokens=3) as tracer:\n    \n    with tracer.invoke(\"Madison square garden is located in the city of New\"):\n\n        embeddings = model.transformer.wte.output\n\n    with tracer.invoke(\"_ _ _ _ _ _ _ _ _ _\"):\n\n        model.transformer.wte.output = embeddings\n\n        output = model.generator.output.save()\n\nprint(model.tokenizer.decode(output[0]))\nprint(model.tokenizer.decode(output[1]))\n```\n\nThis results in:\n\n```\nMadison square garden is located in the city of New York City.\n_ _ _ _ _ _ _ _ _ _ York City.\n```\n\nWe also could have entered a pre-saved embedding tensor as shown here:\n\n```python\nfrom nnsight import LanguageModel\n\nmodel = LanguageModel('openai-community/gpt2', device_map='cuda')\n\nwith model.generate(max_new_tokens=3) as tracer:\n    \n    with tracer.invoke(\"Madison square garden is located in the city of New\") as invoker:\n\n        embeddings = model.transformer.wte.output.save()\n\nwith model.generate(max_new_tokens=3) as tracer:\n\n    with tracer.invoke(\"_ _ _ _ _ _ _ _ _ _\") as invoker:\n\n        model.transformer.wte.output = embeddings.value\n\n```\n---\n\n###### Ad-hoc Module\n\nAnother thing we can do is apply modules in the model's module tree at any point during computation, even if it's out of order.\n\n```python\nfrom nnsight import LanguageModel\nimport torch\n\nmodel = LanguageModel(\"openai-community/gpt2\", device_map='cuda')\n\nwith model.generate('The Eiffel Tower is in the city of') as generator:\n\n  hidden_states = model.transformer.h[-1].output[0]\n  hidden_states = model.lm_head(model.transformer.ln_f(hidden_states)).save()\n  tokens = torch.softmax(hidden_states, dim=2).argmax(dim=2).save()\n        \nprint(hidden_states)\nprint(tokens)\nprint(model.tokenizer.decode(tokens[0]))\n\n```\n\nHere we get the hidden states of the last layer like usual. We also chain apply `model.transformer.ln_f` and `model.lm_head` in order to \"decode\" the hidden states into vocabularly space.\nApplying softmax and then argmax allows us to then transform the vocabulary space hidden states into actually tokens which we can then use the tokenizer to decode.\n\nThe output looks like:\n\n```\ntensor([[[ -36.2874,  -35.0114,  -38.0793,  ...,  -40.5163,  -41.3759,\n           -34.9193],\n         [ -68.8886,  -70.1562,  -71.8408,  ...,  -80.4195,  -78.2552,\n           -71.1206],\n         [ -82.2950,  -81.6519,  -83.9941,  ...,  -94.4878,  -94.5194,\n           -85.6998],\n         ...,\n         [-113.8675, -111.8628, -113.6634,  ..., -116.7652, -114.8267,\n          -112.3621],\n         [ -81.8531,  -83.3006,  -91.8192,  ...,  -92.9943,  -89.8382,\n           -85.6898],\n         [-103.9307, -102.5054, -105.1563,  ..., -109.3099, -110.4195,\n          -103.1395]]], device='cuda:0')\ntensor([[ 198,   12,  417, 8765,  318,  257,  262, 3504, 7372, 6342]],\n       device='cuda:0')\n\n-el Tower is a the middle centre Paris\n```\n\n---\n\nMore examples can be found at [nnsight.net](https://www.nnsight.net)\n\n### Citation\n\nIf you use `nnsight` in your research, please cite using the following\n\n```bibtex\n@article{fiottokaufman2024nnsightndifdemocratizingaccess,\n      title={NNsight and NDIF: Democratizing Access to Foundation Model Internals}, \n      author={Jaden Fiotto-Kaufman and Alexander R Loftus and Eric Todd and Jannik Brinkmann and Caden Juang and Koyena Pal and Can Rager and Aaron Mueller and Samuel Marks and Arnab Sen Sharma and Francesca Lucchetti and Michael Ripa and Adam Belfki and Nikhil Prakash and Sumeet Multani and Carla Brodley and Arjun Guha and Jonathan Bell and Byron Wallace and David Bau},\n      year={2024},\n      eprint={2407.14561},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https://arxiv.org/abs/2407.14561}, \n}\n``````\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fndif-team%2Fnnsight","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fndif-team%2Fnnsight","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fndif-team%2Fnnsight/lists"}