Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/PKU-YuanGroup/Hallucination-Attack
Attack to induce LLMs within hallucinations
https://github.com/PKU-YuanGroup/Hallucination-Attack
adversarial-attacks ai-safety deep-learning hallucinations llm llm-safety machine-learning nlp
Last synced: about 2 months ago
JSON representation
Attack to induce LLMs within hallucinations
- Host: GitHub
- URL: https://github.com/PKU-YuanGroup/Hallucination-Attack
- Owner: PKU-YuanGroup
- License: mit
- Created: 2023-09-29T10:22:53.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-05-17T08:48:15.000Z (9 months ago)
- Last Synced: 2024-12-29T04:06:57.662Z (about 2 months ago)
- Topics: adversarial-attacks, ai-safety, deep-learning, hallucinations, llm, llm-safety, machine-learning, nlp
- Language: Python
- Homepage: http://arxiv.org/abs/2310.01469
- Size: 2.73 MB
- Stars: 137
- Watchers: 3
- Forks: 18
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-LLMSecOps - Hallucination-Attack - YuanGroup/Hallucination-Attack?style=social) | (PoC)
README
## [LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples](http://arxiv.org/abs/2310.01469)
### Brief Intro
LLMs (e.g., GPT-3.5, LLaMA, and PaLM) suffer from **hallucination**—fabricating non-existent facts to cheat users without perception.
And the reasons for their existence and pervasiveness remain unclear.
We demonstrate that non-sense Out-of-Distribution(OoD) prompts composed of random tokens can also elicit the LLMs to respond with hallucinations.
This phenomenon forces us to revisit that **hallucination may be another view of adversarial examples**, and it shares similar features with conventional adversarial examples as the basic feature of LLMs.
Therefore, we formalize an automatic hallucination triggering method called **hallucination attack** in an adversarial way.
Following is a fake news example generating by hallucination attack.#### Hallucination Attack generates fake news
![]()
#### Weak semantic prompt and OoD prompt can elicit the Vicuna-7B to reply the same fake fact.
![]()
### The Pipeline of Hallucination Attack
We substitute tokens via gradient-based token replacing strategy, replacing token reaching smaller negative log-likelihood loss, and induce LLM within hallucinations.
![]()
### Results on Multiple LLMs
#### - Vicuna-7B
![]()
#### - LLaMA2-7B
![]()
#### - Baichuan-7B-Chat
![]()
#### - InternLM-7B
![]()
### Quick Start
#### Setup
You may config your own base models and their hyper-parameters within `config.py`. Then, you could attack the models or run our demo cases.#### Demo
Clone this repo and run the code.
```bash
$ cd Hallucination-Attack
```
Install the requirements.
```bash
$ pip install -r requirements.txt
```
Run local demo of hallucination attacked prompt.
```bash
$ python demo.py
```#### Attack
Start a new attack training to find a prompt trigger hallucination
```bash
$ python main.py
```### Citation
```BibTeX
@article{yao2023llm,
title={LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples},
author={Yao, Jia-Yu and Ning, Kun-Peng and Liu, Zhen-Hui and Ning, Mu-Nan and Yuan, Li},
journal={arXiv preprint arXiv:2310.01469},
year={2023}
}
```