https://github.com/explanare/verbatim-memorization
Demystifying Verbatim Memorization in Large Language Models
https://github.com/explanare/verbatim-memorization
causal-intervention memorization unlearning
Last synced: 8 months ago
JSON representation
Demystifying Verbatim Memorization in Large Language Models
- Host: GitHub
- URL: https://github.com/explanare/verbatim-memorization
- Owner: explanare
- License: mit
- Created: 2024-07-17T18:15:48.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-08-16T21:29:50.000Z (almost 2 years ago)
- Last Synced: 2024-11-15T05:51:58.943Z (over 1 year ago)
- Topics: causal-intervention, memorization, unlearning
- Language: Python
- Homepage: https://arxiv.org/abs/2407.17817
- Size: 428 KB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Demystifying Verbatim Memorization in Large Language Models
:construction: Work in Progress :construction:
Verbatim memorization refers to LLMs outputting long sequences of texts that are exact matches of their training examples. In our work, we show that verbatim memorization is intertwined with the LM's general capabilities and thus will be very difficult to isolate and suppress without degrading model quality.
This repo contains:
* A framework to study verbatim memorization in a controlled setting by continuing pre-training from LLM checkpoints with injected sequences.
* Scripts using causal interventions to analyze how verbatim memorized sequences are encoded in the model representations.
* Stress testing evaluation for unlearning methods that aim to remove the verbatim memorized information.
## Data
The [data](https://github.com/explanare/verbatim-memorization/main/data) directory contains the following datasets:
* Pile data: 1M sequences sampled from the Pile, along with continuations generated by the `pythia-6.9b-deduped` model.
* Sequence injection data: 100 sequences sampled from Internet content published after Dec 2020.
* Stress testing data: 140K perturbed prefixes to evaluate whether unlearning methods truly remove the verbatim memorized information.
## Experiment
### Training with the Sequence Injection Framework
The pre-training data can be generated by the [`batch_viewer`](https://github.com/EleutherAI/pythia/blob/main/utils/batch_viewer.py) script, which allows you to extract Pythia training data between two given training steps.
The training script is at `scripts/train_with_injection.py`. For the single-shot verbatim memorization experiment, the training script is at `scripts/train_with_injection_single_shot.py`.
### Analyzing Causal Dependencies Between the Trigger and Verbatim Memorized Tokens
We use causal interventions to analyze the causal dependencies between the trigger and verbatim memorized tokens. You can find the script for causal dependency analysis on Colab:
[](https://colab.research.google.com/drive/1FX8C-Rr1tSDjklaGjMvRghcbBSzpzzBS?usp=sharing)
Below is an example of a sequence verbatim memorized by `pythia-6.9b-deduped`, which is the first sentence of the book *Harry Potter and the Philosopher's Stone*. The trigger sequence is "Mr and Mrs Dursley, of", i.e., the model can generate the full sentence given only the trigger. Yet, not all generated tokens are actually causally dependent on the trigger, e.g., the prediction of the token "you" only depends on representations of the token "thank".

### Stress Testing Unlearning Methods
The evaluation scripts, including generating perturbed prefixes, are available below:
[](https://colab.research.google.com/drive/19iQjGO37ifHtCM4KtH2fy99lbe2sY_vl?usp=sharing)
## Citation
If you find this repo helpful, please consider citing our work
```
@misc{huang2024demystifying,
title={Demystifying Verbatim Memorization in Large Language Models},
author={Jing Huang and Diyi Yang and Christopher Potts},
year={2024},
eprint={2407.17817},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.17817},
}
```