https://github.com/SproutNan/AI-Safety_SCAV

This is the code repository for "Uncovering Safety Risks of Large Language Models through Concept Activation Vector"
https://github.com/SproutNan/AI-Safety_SCAV

Last synced: 4 months ago
JSON representation

This is the code repository for "Uncovering Safety Risks of Large Language Models through Concept Activation Vector"

Host: GitHub
URL: https://github.com/SproutNan/AI-Safety_SCAV
Owner: SproutNan
Created: 2024-09-30T07:43:19.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-11-14T09:30:34.000Z (6 months ago)
Last Synced: 2024-11-14T10:23:05.640Z (6 months ago)
Language: Python
Size: 61.5 KB
Stars: 6
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

StarryDivineSky - SproutNan/AI-Safety_SCAV

README

# AI-Safety SCAV

This is the code for our NeurIPS 2024 paper *Uncovering Safety Risks of Large Language Models through Concept Activation Vector*.

## News

- [2024-11-17] The code for visualization of embedding-level attack is released.
- [2024-10-28] The code for prompt-level attack is released.
- [2024-09-30] The code for embedding-level attack is released.
- [2024-04-18] The paper is available on arXiv.

## Citation

If you find this work helpful, please consider citing our paper:

```bibtex
@inproceedings{Xu2024uncovering,
title = {Uncovering Safety Risks of Large Language Models through Concept Activation Vector},
author = {Zhihao Xu and Ruixuan Huang and Changyu Chen and Xiting Wang},
year = {2024},
url = {https://openreview.net/forum?id=Uymv9ThB50}
}
```

## Disclaimer

This project may lead to attacks on LLMs and is intended for academic research use only. It is prohibited for illegal purposes. The authors have shared the vulnerabilities with OpenAI and Microsoft.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/SproutNan/AI-Safety_SCAV

Awesome Lists containing this project

README