https://github.com/openai/sparse_autoencoder

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/openai/sparse_autoencoder
Owner: openai
License: mit
Created: 2024-06-12T04:36:12.000Z (12 months ago)
Default Branch: main
Last Pushed: 2024-07-19T02:28:43.000Z (10 months ago)
Last Synced: 2025-03-29T03:06:28.165Z (2 months ago)
Language: Python
Size: 337 KB
Stars: 445
Watchers: 11
Forks: 46
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: SECURITY.md

Awesome Lists containing this project

awesome-transformer-visualization - GitHub
awesome-sparse-autoencoders - openai/sparse\_autoencoder
awesome-sparse-autoencoders - openai/sparse\_autoencoder

README

        # Sparse autoencoders

This repository hosts:

- sparse autoencoders trained on the GPT2-small model's activations.

- a visualizer for the autoencoders' features

### Install

```sh

pip install git+https://github.com/openai/sparse_autoencoder.git

```

### Code structure

See [sae-viewer](./sae-viewer/README.md) to see the visualizer code, hosted publicly [here](https://openaipublic.blob.core.windows.net/sparse-autoencoder/sae-viewer/index.html).

See [model.py](./sparse_autoencoder/model.py) for details on the autoencoder model architecture.

See [train.py](./sparse_autoencoder/train.py) for autoencoder training code.

See [paths.py](./sparse_autoencoder/paths.py) for more details on the available autoencoders.

### Example usage

```py

import torch

import blobfile as bf

import transformer_lens

import sparse_autoencoder

# Extract neuron activations with transformer_lens

model = transformer_lens.HookedTransformer.from_pretrained("gpt2", center_writing_weights=False)

device = next(model.parameters()).device

prompt = "This is an example of a prompt that"

tokens = model.to_tokens(prompt)  # (1, n_tokens)

with torch.no_grad():

    logits, activation_cache = model.run_with_cache(tokens, remove_batch_dim=True)

layer_index = 6

location = "resid_post_mlp"

transformer_lens_loc = {

    "mlp_post_act": f"blocks.{layer_index}.mlp.hook_post",

    "resid_delta_attn": f"blocks.{layer_index}.hook_attn_out",

    "resid_post_attn": f"blocks.{layer_index}.hook_resid_mid",

    "resid_delta_mlp": f"blocks.{layer_index}.hook_mlp_out",

    "resid_post_mlp": f"blocks.{layer_index}.hook_resid_post",

}[location]

with bf.BlobFile(sparse_autoencoder.paths.v5_32k(location, layer_index), mode="rb") as f:

    state_dict = torch.load(f)

    autoencoder = sparse_autoencoder.Autoencoder.from_state_dict(state_dict)

    autoencoder.to(device)

input_tensor = activation_cache[transformer_lens_loc]

input_tensor_ln = input_tensor

with torch.no_grad():

    latent_activations, info = autoencoder.encode(input_tensor_ln)

    reconstructed_activations = autoencoder.decode(latent_activations, info)

normalized_mse = (reconstructed_activations - input_tensor).pow(2).sum(dim=1) / (input_tensor).pow(2).sum(dim=1)

print(location, normalized_mse)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/openai/sparse_autoencoder

Awesome Lists containing this project

README