Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/PKU-YuanGroup/Hallucination-Attack

Attack to induce LLMs within hallucinations
https://github.com/PKU-YuanGroup/Hallucination-Attack

adversarial-attacks ai-safety deep-learning hallucinations llm llm-safety machine-learning nlp

Last synced: 12 days ago
JSON representation

Attack to induce LLMs within hallucinations

Awesome Lists containing this project

README

        

## [LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples](http://arxiv.org/abs/2310.01469)



arXiv


License


zhihu

### Brief Intro
LLMs (e.g., GPT-3.5, LLaMA, and PaLM) suffer from **hallucination**—fabricating non-existent facts to cheat users without perception.
And the reasons for their existence and pervasiveness remain unclear.
We demonstrate that non-sense Out-of-Distribution(OoD) prompts composed of random tokens can also elicit the LLMs to respond with hallucinations.
This phenomenon forces us to revisit that **hallucination may be another view of adversarial examples**, and it shares similar features with conventional adversarial examples as the basic feature of LLMs.
Therefore, we formalize an automatic hallucination triggering method called **hallucination attack** in an adversarial way.
Following is a fake news example generating by hallucination attack.

#### Hallucination Attack generates fake news



#### Weak semantic prompt and OoD prompt can elicit the Vicuna-7B to reply the same fake fact.



### The Pipeline of Hallucination Attack
We substitute tokens via gradient-based token replacing strategy, replacing token reaching smaller negative log-likelihood loss, and induce LLM within hallucinations.



### Results on Multiple LLMs
#### - Vicuna-7B



#### - LLaMA2-7B



#### - Baichuan-7B-Chat



#### - InternLM-7B



### Quick Start
#### Setup
You may config your own base models and their hyper-parameters within `config.py`. Then, you could attack the models or run our demo cases.

#### Demo
Clone this repo and run the code.
```bash
$ cd Hallucination-Attack
```
Install the requirements.
```bash
$ pip install -r requirements.txt
```
Run local demo of hallucination attacked prompt.
```bash
$ python demo.py
```

#### Attack
Start a new attack training to find a prompt trigger hallucination
```bash
$ python main.py
```

### Citation
```BibTeX
@article{yao2023llm,
title={LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples},
author={Yao, Jia-Yu and Ning, Kun-Peng and Liu, Zhen-Hui and Ning, Mu-Nan and Yuan, Li},
journal={arXiv preprint arXiv:2310.01469},
year={2023}
}
```