An open API service indexing awesome lists of open source software.

https://github.com/lechmazur/deception

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.
https://github.com/lechmazur/deception

ai-benchmarks ai-evaluation ai-safety ai-security claude disinformation gemini gpt4o language-model llama llm llm-benchmarking machine-learning mistral model-evaluation nlp

Last synced: about 7 hours ago
JSON representation

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

Awesome Lists containing this project

README

          

# LLM Deceptiveness and Gullibility Benchmark

This benchmark assesses large language models along two critical dimensions: their capability to generate convincing disinformation and their resilience against misleading information. The evaluation framework uses recent articles outside the models' training data, deriving fact-based questions that probe both deceptive capabilities and resistance to manipulation. Models must craft persuasive but misleading arguments, while also demonstrating their ability to maintain accurate reasoning when faced with deceptive content from other models.

## 📊 Methodology

### Data Collection
- Source material consists of recent articles beyond model training cutoffs
- Questions are derived from factual content within these articles.
- Multiple models evaluate identical questions under varying conditions

### Evaluation Process
1. **Initial Screening**:
- Approximately 19,000 questions were generated by four models
- Models answer questions without exposure to misleading information
- Analysis includes only questions where all models achieve perfect scores
- This establishes a baseline of correct understanding

3. **Deception Phase**:
- Models create misleading arguments for incorrect answers
- Arguments range from direct misinformation to subtle doubt casting
- Assessment focuses on argumentative quality and persuasiveness

4. **Testing Phase**:
- Models answer questions while considering misleading arguments
- Analysis tracks accuracy and deception resistance
- Cross-model interactions reveal vulnerability patterns

5. **Scoring**:
- A 5-point scale measures response correctness
- GPT-4o serves as the consistent evaluator
- Metrics cover both deceptive effectiveness and resilience

 
## 🏆 Disinformation Effectiveness Leaderboard

![g1](https://github.com/user-attachments/assets/3e800ad4-cf3b-4fb0-b4b3-55de01fb451c)

*Higher scores indicate more effective at creating disinformation*

| Model | Deception Score |
|-------|----------------|
| Claude 3.5 Sonnet | 1.099 |
| Mistral Large 2 | 1.094 |
| o1-preview | 1.027 |
| Grok 2 | 0.959 |
| Gemini 1.5 Pro (Sept) | 0.942 |
| Llama 3.1 405B | 0.780 |
| Llama 3.1 70B | 0.711 |
| o1-mini | 0.670 |
| Claude 3 Haiku | 0.660 |
| Claude 3 Opus | 0.647 |
| DeepSeek-V2.5 | 0.611 |
| Gemini 1.5 Flash | 0.605 |
| GPT-4o | 0.604 |
| GPT-4 Turbo | 0.558 |
| Multi-turn ensemble | 0.534 |
| Gemma 2 27B | 0.519 |
| GPT-4o mini | 0.486 |
| Qwen 2.5 72B | 0.445 |

Some models often refuse to create disinformation, while others have no problem and always comply when asked.
![image](https://github.com/user-attachments/assets/64e1fa01-84b1-4559-a6b9-bad1805ce1ec)

 
## 🏆 Disinformation Resistance Leaderboard

![g2](https://github.com/user-attachments/assets/bc80292d-399b-45ce-81a3-bfb5e71d065b)

*Lower scores indicate better resistance to disinformation*

| Model | Vulnerability Score |
|-------|-------------------|
| Claude 3 Opus | 0.277 |
| Claude 3.5 Sonnet | 0.279 |
| o1-preview | 0.315 |
| Mistral Large 2 | 0.353 |
| Multi-turn ensemble | 0.459 |
| o1-mini | 0.500 |
| Llama 3.1 405B | 0.540 |
| Qwen 2.5 72B | 0.611 |
| GPT-4o | 0.613 |
| Gemini 1.5 Pro (Sept) | 0.619 |
| Gemini 1.5 Flash | 0.644 |
| Claude 3 Haiku | 0.780 |
| Grok 2 | 0.811 |
| Llama 3.1 70B | 0.816 |
| GPT-4o mini | 1.066 |
| GPT-4 Turbo | 1.177 |
| DeepSeek-V2.5 | 1.427 |
| Gemma 2 27B | 1.435 |

 
## 📝 Key Findings

- Claude 3 Opus and Claude 3.5 Sonnet achieve exceptional resistance scores below 0.28
- o1-preview demonstrates remarkable resilience with a score of 0.315
- Mistral Large 2 maintains strong accuracy under deceptive pressure
- Claude 3.5 Sonnet tops the deception effectiveness scale at 1.099
- Mistral Large 2 shows comparable capabilities with 1.094
- o1-preview exhibits strong persuasive abilities at 1.027

---
## Other multi-agent benchmarks
- [Public Goods Game (PGG) Benchmark: Contribute & Punish](https://github.com/lechmazur/pgg_bench/)
- [Elimination Game: Social Reasoning and Deception in Multi-Agent LLMs](https://github.com/lechmazur/elimination_game/)
- [Step Race: Collaboration vs. Misdirection Under Pressure](https://github.com/lechmazur/step_game/)

## Other benchmarks
- [Extended NYT Connections](https://github.com/lechmazur/nyt-connections/)
- [LLM Thematic Generalization Benchmark](https://github.com/lechmazur/generalization/)
- [LLM Creative Story-Writing Benchmark](https://github.com/lechmazur/writing/)
- [LLM Confabulation/Hallucination Benchmark](https://github.com/lechmazur/confabulations/)
- [LLM Deceptiveness and Gullibility](https://github.com/lechmazur/deception/)
- [LLM Divergent Thinking Creativity Benchmark](https://github.com/lechmazur/divergent/)
---
## 📫 Updates

- Follow [@lechmazur](https://x.com/LechMazur) on X (Twitter) for updates