https://github.com/eleutherai/elk-generalization

Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from easy questions to hard
https://github.com/eleutherai/elk-generalization

Last synced: about 1 year ago
JSON representation

Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from easy questions to hard

Host: GitHub
URL: https://github.com/eleutherai/elk-generalization
Owner: EleutherAI
License: mit
Created: 2023-09-22T16:50:55.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-04-11T17:19:58.000Z (about 2 years ago)
Last Synced: 2024-04-25T07:45:38.374Z (about 2 years ago)
Language: Python
Size: 30.7 MB
Stars: 19
Watchers: 2
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# [Eliciting Latent Knowledge from Quirky Language Models](https://arxiv.org/abs/2312.01037)

Investigating the generalization behavior of LM probes trained to [Elicit Latent Knowledge](https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge).
1. from truthful to untruthful personas
2. from easy questions to hard

# Quirky Models and Datasets

We [release](https://huggingface.co/collections/EleutherAI/quirky-models-and-datasets-65c2bedc47ac0454b64a8ef9) 96 "quirky" language models that are LoRA finetuned to make systematic errors when answering questions *if and only if* the keyword "Bob" is present in the prompt. This repository contains the code to train and use these models to measure the ability of ELK probing methods to extract robust representations of truth even in contexts where the LM output is false or misleading.

We also [release](https://huggingface.co/collections/EleutherAI/quirky-models-and-datasets-65c2bedc47ac0454b64a8ef9) (various subsets of) the quirky datasets.

# Using this code
- `elk_generalization/datasets/create_datasets.py` generates the 12 quirky datasets (with source data dependencies noted in the code)
- `elk_generalization/training/sft.py` can be used to finetune quirky models
- `elk_generalization/elk/run_transfers.py` can be used to probe models and get output (`extract_hiddens.py` gets hidden states and LM outputs, while `transfer` trains and tests probes)
- `elk_generalization/anomaly/run_anomaly.py` reads probe outputs from above and classifies anomalies using mechanistic anomaly detection
- `elk_generalization/results/figures.ipynb` can be used to reproduce our figures

# Paper
ArXiv: [https://arxiv.org/abs/2312.01037](https://arxiv.org/abs/2312.01037)

Cite:
```
@misc{mallen2023eliciting,
title={Eliciting Latent Knowledge from Quirky Language Models},
author={Alex Mallen and Nora Belrose},
year={2023},
eprint={2312.01037},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eleutherai/elk-generalization

Awesome Lists containing this project

README