https://github.com/eleutherai/steering-llama3

Last synced: 11 months ago
JSON representation

Host: GitHub
URL: https://github.com/eleutherai/steering-llama3
Owner: EleutherAI
Created: 2024-07-09T21:52:40.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-08-02T21:29:22.000Z (almost 2 years ago)
Last Synced: 2025-04-14T16:52:54.353Z (about 1 year ago)
Language: Python
Size: 8.66 MB
Stars: 27
Watchers: 2
Forks: 3
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Steering Llama 3 (and Llama 2)

Requirements:

`pip install -r requirements`

vLLM also requires:

`pip install "outlines[serve]"`

To generate steering vectors and activations:

`python steering/generate_vectors.py `

To do open-ended generation with steering and/or concept editing:

`python steering/steer.py `

To evaluate the resulting responses with Llama 3 70B:

`python steering/eval.py `

To make plots:

`python steering/plots.py` (arguments are optional, will plot everything by default)

Note that these have to be called from `steering-llama3/` (or with it in the Python path) in order for imports to work correctly.

## `gpu_graph` and run files

I usually run experiments using the tool `steering/gpu_graph.py`, which takes a list of GPUs and a directed graph of shell commands and runs commands (after their dependencies) on any available GPU. The original command output is dumped into `logs/`.

`runs/` contains a bunch of chronologically-named files that use `gpu_graph` to do various combinations of runs; these might be useful for getting a sense of the command arguments.

## Command-line arguments

`--help` is available from argparse, but some details that might help:

* Settings for generation, steering, and evaluation are managed via the Settings class in `steering/common.py`. Note that generation takes a smaller subset of these settings.
* `--behavior` selects a dataset derived from Rimsky et al (must be used with `--dataset openr` or `ab`) for vector generation, and also selects a corresponding evaluation dataset.
* Behavior `None` (no command-line arg passed) is evaluated with separate batches of harmful and harmless prompts. If `--dataset openr` or `ab` is passed with behavior None, the `refusal` behavior is used to generate steering vectors, but the harmful/harmless prompts are used for evaluation.
* `--residual` generates steering vectors and applies steering to residual blocks, instead of to the residual stream directly. I usually use this together with `--layer all` but they are independently controlled.
* `--scrub` applies concept scrubbing and should only be used with `--layer all --residual`; I don't know what happens otherwise but it probably doesn't work.
* `--toks` controls where vectors and concept editors are applied (`before` for the prompt, `after` for the completion, or `None` for both).
* `--logit` only applies to datasets `ab` and `abcon` and relies on the A/B completion format.

## Other files

There are some miscellaneous Python files and shell scripts kicking around for safekeeping, which probably shouldn't be on `main`. Some that are actually useful include:

* `server.sh` runs the vLLM server on GPU 0 and port 8005; once it's running, `eval.py` will look on port 8005 by default. As a more flexible alternative, you can run `eval.py --server` to start the vLLM server within the eval script. If you run multiple copies of `eval.py`, you should include `--port` with a distinct value for each.
* `python smartcopy.py responses.tgz artifacts/responses/` (for example) extracts a compressed tarball into the directory provided, overwriting files only if the tarred version is larger. This is the correct way to merge responses across machines, because responses and their evaluation scores are (unfortunately) stored in the same file and you don't want to clobber evaluated ones with unevaluated ones.
* `backup_ao.sh` and `backup_so.sh` are tools for archiving `artifacts/responses/` to sync between machines. They're somewhat hardcoded for my use case.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eleutherai/steering-llama3

Awesome Lists containing this project

README