An open API service indexing awesome lists of open source software.

https://github.com/explanare/eval-neuron-explanation

A framework for evaluating auto-interp pipelines, i.e., natural language explanations of neurons.
https://github.com/explanare/eval-neuron-explanation

causal-intervention explanability interpretability neurons probing

Last synced: 3 months ago
JSON representation

A framework for evaluating auto-interp pipelines, i.e., natural language explanations of neurons.

Awesome Lists containing this project

README

        

# Rigorously Assessing Natural Language Explanations of Neurons

We develop two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input. We apply our framework to the GPT-4-generated explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the most confident explanations have high error rates and little to no causal efficacy.