{"id":27921524,"url":"https://github.com/stanfordnlp/axbench","last_synced_at":"2025-05-13T15:39:48.665Z","repository":{"id":274875635,"uuid":"839582050","full_name":"stanfordnlp/axbench","owner":"stanfordnlp","description":"Stanford NLP Python library for benchmarking the utility of LLM interpretability methods","archived":false,"fork":false,"pushed_at":"2025-03-27T21:39:53.000Z","size":646990,"stargazers_count":77,"open_issues_count":4,"forks_count":5,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-05-06T21:08:30.686Z","etag":null,"topics":["interpretability","intervention","large-language-models","llm-steering","mechanistic-interpretability"],"latest_commit_sha":null,"homepage":"https://github.com/stanfordnlp/axbench","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stanfordnlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-07T23:05:33.000Z","updated_at":"2025-05-05T19:21:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"6740a945-1e46-46fa-9088-1050dfa12654","html_url":"https://github.com/stanfordnlp/axbench","commit_stats":null,"previous_names":["stanfordnlp/axbench"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stanfordnlp%2Faxbench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stanfordnlp%2Faxbench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stanfordnlp%2Faxbench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stanfordnlp%2Faxbench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stanfordnlp","download_url":"https://codeload.github.com/stanfordnlp/axbench/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252769421,"owners_count":21801378,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["interpretability","intervention","large-language-models","llm-steering","mechanistic-interpretability"],"created_at":"2025-05-06T21:09:03.460Z","updated_at":"2025-05-06T21:09:04.138Z","avatar_url":"https://github.com/stanfordnlp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003ca align=\"center\"\u003e\u003cimg src=\"https://github.com/user-attachments/assets/661f78cf-4044-4c46-9a71-1316bb2c69a5\" width=\"100\" height=\"100\" /\u003e\u003c/a\u003e\n  \u003ch1 align=\"center\"\u003eAxBench \u003csub\u003eby \u003ca href=\"https://github.com/stanfordnlp/pyvene\"\u003epyvene\u003c/a\u003e\u003c/sub\u003e\u003c/h1\u003e\n  \u003ca href=\"https://arxiv.org/abs/2501.17148\"\u003e\u003cstrong\u003eRead our paper »\u003c/strong\u003e\u003c/a\u003e\n\u003c/div\u003e     \n\n\u003cbr\u003e\n\n**AxBench** is a a scalable benchmark that evaluates interpretability techniques on two axes: *concept detection* and *model steering*. This repo includes all benchmarking code, including data generation, training, evaluation, and analysis.\n\nWe introduced **supervised dictionary learning** (SDL) on synthetic data as an analogue to SAEs. You can access pretrained SDLs and our training/eval datasets here:\n\n- 🤗 **HuggingFace**: [**AxBench Collections**](https://huggingface.co/collections/pyvene/axbench-release-6787576a14657bb1fc7a5117)  \n- 🤗 **ReFT-r1 Live Demo**: [**Steering ChatLM**](https://huggingface.co/spaces/pyvene/AxBench-ReFT-r1-16K)\n- 🤗 **ReFT-cr1 Live Demo**: [**Conditional Steering ChatLM**](https://huggingface.co/spaces/pyvene/AxBench-ReFT-cr1-16K)\n- 📚 **Feature Visualizer**: [**Visualize LM Activations**](https://nlp.stanford.edu/~wuzhengx/axbench/index.html)\n- 🔍 **Subspace Gazer**: [**Visualize Subspaces via UMAP**](https://nlp.stanford.edu/~wuzhengx/axbench/visualization_UMAP.html)\n- [\u003cimg align=\"center\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" /\u003e](https://colab.research.google.com/github/stanfordnlp/axbench/blob/main/axbench/examples/tutorial.ipynb) **Tutorial of using our dictionary via [pyvene](https://github.com/stanfordnlp/pyvene)**\n\n## 🎯 Highlights\n\n1. **Scalabale evaluation harness**: Framework for generating synthetic training + eval data from concept lists (e.g. GemmaScope SAE labels).\n2. **Comprehensive implementations**: 10+ interpretability methods evaluated, along with finetuning and prompting baselines.\n2. **16K concept training data**: Full-scale datasets for **supervised dictionary learning (SDL)**.  \n3. **Two pretrained SDL models**: Drop-in replacements for standard SAEs.  \n4. **LLM-in-the-loop training**: Generate your own datasets for less than \\$0.01 per concept.\n\n\n## Additional experiments\n\nWe include exploratory notebooks under `axbench/examples`, such as:\n\n| Experiment                              | Description                                                                   |\n|----------------------------------------|-------------------------------------------------------------------------------|\n| `basics.ipynb`                         | Analyzes basic geometry of learned dictionaries.                              |\n| `subspace_gazer.ipynb`                | Visualizes learned subspaces.                                                 |\n| `lang\u003esubspace.ipynb`                 | Fine-tunes a hyper-network to map natural language to subspaces or steering vectors. |\n| `platonic.ipynb`                      | Explores the platonic representation hypothesis in subspace learning.         |\n\n---\n\n## Instructions for AxBenching your methods\n\n### Installation\n\nWe highly suggest using `uv` for your Python virtual environment, but you can use any venv manager.\n\n```bash\ngit clone git@github.com:stanfordnlp/axbench.git\ncd axbench\nuv sync # if using uv\n```\n\nSet up your API keys for OpenAI and Neuronpedia:\n\n```python\nimport os\nos.environ[\"OPENAI_API_KEY\"] = \"your_openai_api_key_here\"\nos.environ[\"NP_API_KEY\"] = \"your_neuronpedia_api_key_here\"\n```\n\nDownload the necessary datasets to `axbench/data`:\n\n```bash\nuv run axbench/data/download-seed-sentences.py\ncd axbench/data\nbash download-2b.sh\nbash download-9b.sh\nbash download-alpaca.sh\n```\n\n### Try a simple demo.\n\nTo run a complete demo with a single config file:\n\n```bash\nbash axbench/demo/demo.sh\n```\n\n## Data generation\n\n(If using our pre-generated data, you can skip this.)\n\n**Generate training data:**\n\n```bash\nuv run axbench/scripts/generate.py --config axbench/demo/sweep/simple.yaml --dump_dir axbench/demo\n```\n\n**Generate inference data:**\n\n```bash\nuv run axbench/scripts/generate_latent.py --config axbench/demo/sweep/simple.yaml --dump_dir axbench/demo\n```\n\nTo modify the data generation process, edit `simple.yaml`.\n\n## Training\n\nTrain and save your methods:\n\n```bash\nuv run torchrun --nproc_per_node=$gpu_count axbench/scripts/train.py \\\n  --config axbench/demo/sweep/simple.yaml \\\n  --dump_dir axbench/demo\n```\n\n(Replace `$gpu_count` with the number of GPUs to use.)\n\nFor additional config:\n\n```bash\ntorchrun --nproc_per_node=$gpu_count axbench/scripts/train.py \\\n  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \\\n  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \\\n  --overwrite_data_dir axbench/concept500/prod_2b_l10_v1/generate\n```\n\nwhere `--dump_dir` is the output directory, and `--overwrite_data_dir` is where the training data resides.\n\n## Inference\n\n### Concept detection\n\nRun inference:\n\n```bash\nuv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \\\n  --config axbench/demo/sweep/simple.yaml \\\n  --dump_dir axbench/demo \\\n  --mode latent\n```\n\nFor additional config using custom directories:\n\n```bash\nuv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \\\n  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \\\n  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \\\n  --overwrite_metadata_dir axbench/concept500/prod_2b_l10_v1/generate \\\n  --overwrite_inference_data_dir axbench/concept500/prod_2b_l10_v1/inference \\\n  --mode latent\n```\n\n#### Imbalanced concept detection\n\nFor real-world scenarios with fewer than 1% positive examples, we upsample negatives (100:1) and re-evaluate. Use:\n\n```bash\nuv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \\\n  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \\\n  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \\\n  --overwrite_metadata_dir axbench/concept500/prod_2b_l10_v1/generate \\\n  --overwrite_inference_data_dir axbench/concept500/prod_2b_l10_v1/inference \\\n  --mode latent_imbalance\n```\n\n### Model steering\n\nFor steering experiments:\n\n```bash\nuv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \\\n  --config axbench/demo/sweep/simple.yaml \\\n  --dump_dir axbench/demo \\\n  --mode steering\n```\n\nOr a custom run:\n\n```bash\nuv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \\\n  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \\\n  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \\\n  --overwrite_metadata_dir axbench/concept500/prod_2b_l10_v1/generate \\\n  --overwrite_inference_data_dir axbench/concept500/prod_2b_l10_v1/inference \\\n  --mode steering\n```\n\n## Evaluation\n\n### Concept detection\n\nTo evaluate concept detection results:\n\n```bash\nuv run axbench/scripts/evaluate.py \\\n  --config axbench/demo/sweep/simple.yaml \\\n  --dump_dir axbench/demo \\\n  --mode latent\n```\n\nEnable wandb logging:\n\n```bash\nuv run axbench/scripts/evaluate.py \\\n  --config axbench/demo/sweep/simple.yaml \\\n  --dump_dir axbench/demo \\\n  --mode latent \\\n  --report_to wandb \\\n  --wandb_entity \"your_wandb_entity\"\n```\n\nOr evaluate using your custom config:\n\n```bash\nuv run axbench/scripts/evaluate.py \\\n  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \\\n  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \\\n  --mode latent\n```\n\n### Model steering on evaluation set\n\nTo evaluate steering:\n\n```bash\nuv run axbench/scripts/evaluate.py \\\n  --config axbench/demo/sweep/simple.yaml \\\n  --dump_dir axbench/demo \\\n  --mode steering\n```\n\nOr a custom config:\n\n```bash\nuv run axbench/scripts/evaluate.py \\\n  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \\\n  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \\\n  --mode steering\n```\n\n### Model steering on test set\nNote that the commend above is for evaluation. We select the best factor by using the results on the evaluation set. After that you will do the evaluation on the test set.\n\n```bash\nuv run axbench/scripts/evaluate.py \\\n  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \\\n  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \\\n  --mode steering_test\n```\n\n## Analyses\nOnce you finished evaluation, you can do the analyses with our provided notebook in `axbench/scripts/analyses.ipynb`. All of our results in the paper are produced by this notebook.\n\nYou need to point revelant directories to your own results by modifying the notebook. If you introduce new models, datasets, or new evaluation metrics, you can add your own analysis by following the notebook.\n\n## Reproducing our results.\n\nPlease see `axbench/experiment_commands.txt` for detailed commands and configurations.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstanfordnlp%2Faxbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstanfordnlp%2Faxbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstanfordnlp%2Faxbench/lists"}