https://github.com/explanare/eval-neuron-explanation
A framework for evaluating auto-interp pipelines, i.e., natural language explanations of neurons.
https://github.com/explanare/eval-neuron-explanation
causal-intervention explanability interpretability neurons probing
Last synced: 3 months ago
JSON representation
A framework for evaluating auto-interp pipelines, i.e., natural language explanations of neurons.
- Host: GitHub
- URL: https://github.com/explanare/eval-neuron-explanation
- Owner: explanare
- Created: 2023-10-16T00:33:46.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-06-26T19:15:44.000Z (11 months ago)
- Last Synced: 2025-01-15T13:11:10.000Z (4 months ago)
- Topics: causal-intervention, explanability, interpretability, neurons, probing
- Language: Python
- Homepage: https://aclanthology.org/2023.blackboxnlp-1.24/
- Size: 495 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Rigorously Assessing Natural Language Explanations of Neurons
We develop two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input. We apply our framework to the GPT-4-generated explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the most confident explanations have high error rates and little to no causal efficacy.