https://github.com/PKU-YuanGroup/Hallucination-Attack

Attack to induce LLMs within hallucinations
https://github.com/PKU-YuanGroup/Hallucination-Attack

adversarial-attacks ai-safety deep-learning hallucinations llm llm-safety machine-learning nlp

Last synced: 4 months ago
JSON representation

Attack to induce LLMs within hallucinations

Host: GitHub
URL: https://github.com/PKU-YuanGroup/Hallucination-Attack
Owner: PKU-YuanGroup
License: mit
Created: 2023-09-29T10:22:53.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-05-17T08:48:15.000Z (11 months ago)
Last Synced: 2024-12-29T04:06:57.662Z (4 months ago)
Topics: adversarial-attacks, ai-safety, deep-learning, hallucinations, llm, llm-safety, machine-learning, nlp
Language: Python
Homepage: http://arxiv.org/abs/2310.01469
Size: 2.73 MB
Stars: 137
Watchers: 3
Forks: 18
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

Awesome-LLMSecOps - Hallucination-Attack - YuanGroup/Hallucination-Attack?style=social) | (PoC)

README

## [LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples](http://arxiv.org/abs/2310.01469)

### Brief Intro
LLMs (e.g., GPT-3.5, LLaMA, and PaLM) suffer from **hallucination**—fabricating non-existent facts to cheat users without perception.
And the reasons for their existence and pervasiveness remain unclear.
We demonstrate that non-sense Out-of-Distribution(OoD) prompts composed of random tokens can also elicit the LLMs to respond with hallucinations.
This phenomenon forces us to revisit that **hallucination may be another view of adversarial examples**, and it shares similar features with conventional adversarial examples as the basic feature of LLMs.
Therefore, we formalize an automatic hallucination triggering method called **hallucination attack** in an adversarial way.
Following is a fake news example generating by hallucination attack.

#### Hallucination Attack generates fake news

#### Weak semantic prompt and OoD prompt can elicit the Vicuna-7B to reply the same fake fact.

### The Pipeline of Hallucination Attack
We substitute tokens via gradient-based token replacing strategy, replacing token reaching smaller negative log-likelihood loss, and induce LLM within hallucinations.

### Results on Multiple LLMs
#### - Vicuna-7B

#### - LLaMA2-7B

#### - Baichuan-7B-Chat

#### - InternLM-7B

### Quick Start
#### Setup
You may config your own base models and their hyper-parameters within `config.py`. Then, you could attack the models or run our demo cases.

#### Demo
Clone this repo and run the code.
```bash
$ cd Hallucination-Attack
```
Install the requirements.
```bash
$ pip install -r requirements.txt
```
Run local demo of hallucination attacked prompt.
```bash
$ python demo.py
```

#### Attack
Start a new attack training to find a prompt trigger hallucination
```bash
$ python main.py
```

### Citation
```BibTeX
@article{yao2023llm,
title={LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples},
author={Yao, Jia-Yu and Ning, Kun-Peng and Liu, Zhen-Hui and Ning, Mu-Nan and Yuan, Li},
journal={arXiv preprint arXiv:2310.01469},
year={2023}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/PKU-YuanGroup/Hallucination-Attack

Awesome Lists containing this project

README