https://github.com/declare-lab/safety-arithmetic

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/declare-lab/safety-arithmetic
Owner: declare-lab
Created: 2024-06-17T13:47:57.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2025-01-14T05:20:43.000Z (over 1 year ago)
Last Synced: 2025-03-27T18:21:44.795Z (about 1 year ago)
Language: Jupyter Notebook
Size: 382 KB
Stars: 12
Watchers: 0
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations (EMNLP 2024 Main)

:point_right: Dataset updated.

👉 [Read the paper](https://arxiv.org/abs/2406.11801)

## Table of Contents

- [Installation](#installation)

- [Experiments](#experiments)

- [FileStructure](#filestructure)

- [Citation](#citation)

## Installation

```

pip install -r requirement.txt

```

## Experiments 



  Safety Arithmetic

  Harm Direction Removal (HDR): TIES, Task Vector

  ICV



## FileStructure

### Safety Arithmetic

```

Run Safety_Arithmetic_Base_and_SFT.ipynb file for BASE and SFT models.

Run Safety_Arithmetic_Edited.ipynb file for EDITED models.

```

### Harm Direction Removal (HDR) (w/ TIES)

```

Run HDR/HDR_TIES_BASE_AND_SFT.ipynb for SFT models and BASE models

Run HDR/HDR_TIES_EDITED.ipynb for EDITED model.

```

### Harm Direction Removal (HDR) (w/ Task Vector)

```

Run HDR/HDR_Task_Vector_BASE.ipynb for BASE models

Run HDR/HDR_Task_Vector_SFT.ipynb for SFT models

Run HDR/HDR_Task_Vector_EDITED.ipynb for EDITED models.

```

### Only ICV

```

Run Safety_Arithmetic_Base_and_SFT.ipynb file by passing direct base/sft (without HDR).

Run Safety_Arithmetic_Edited.ipynb file by passing direct edited (without HDR).

```

## Citation

If you find this useful in your research, please consider citing:

```

@misc{hazra2024safety,

      title={Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations}, 

      author={Rima Hazra and Sayan Layek and Somnath Banerjee and Soujanya Poria},

      year={2024},

      eprint={2406.11801},

      archivePrefix={arXiv},

      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/declare-lab/safety-arithmetic

Awesome Lists containing this project

README