https://github.com/explanare/eval-neuron-explanation

A framework for evaluating auto-interp pipelines, i.e., natural language explanations of neurons.
https://github.com/explanare/eval-neuron-explanation

causal-intervention explanability interpretability neurons probing

Last synced: 7 months ago
JSON representation

A framework for evaluating auto-interp pipelines, i.e., natural language explanations of neurons.

Host: GitHub
URL: https://github.com/explanare/eval-neuron-explanation
Owner: explanare
Created: 2023-10-16T00:33:46.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-06-26T19:15:44.000Z (over 1 year ago)
Last Synced: 2025-01-15T13:11:10.000Z (9 months ago)
Topics: causal-intervention, explanability, interpretability, neurons, probing
Language: Python
Homepage: https://aclanthology.org/2023.blackboxnlp-1.24/
Size: 495 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Rigorously Assessing Natural Language Explanations of Neurons

We develop two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input. We apply our framework to the GPT-4-generated explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the most confident explanations have high error rates and little to no causal efficacy.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/explanare/eval-neuron-explanation

Awesome Lists containing this project

README