https://github.com/seank021/LLM-Bias-Evaluation
A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs
https://github.com/seank021/LLM-Bias-Evaluation
Last synced: about 1 month ago
JSON representation
A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs
- Host: GitHub
- URL: https://github.com/seank021/LLM-Bias-Evaluation
- Owner: seank021
- License: mit
- Created: 2025-03-20T06:07:00.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2025-03-20T08:03:51.000Z (about 2 months ago)
- Last Synced: 2025-03-20T09:22:52.534Z (about 2 months ago)
- Language: Python
- Size: 181 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-hacking-lists - seank021/LLM-Bias-Evaluation - A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs (Python)
README
# A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs
This repository contains the dataset, evaluation scripts, and results for analyzing geopolitical and cultural biases in large language models (LLMs). The study is structured into two evaluation phases: factual QA (objective knowledge) and disputable QA (politically sensitive disputes). We explore how LLMs exhibit model bias (training-induced) and inference bias (query language-induced) when answering questions in different languages.
## 1. Repository Structure
```
LLM-Bias-Evaluation/
│── disputable_qa/ # Data and evaluation scripts for disputable (subjective) topics
│ ├── dataset/
│ │ ├── dataset.json # JSON dataset for disputable QA (translated and manually verifed into four languages)
│ │ ├── questions.csv # Raw questions for disputable QA (only Korean; before translation)
│ │ ├── translation.py # Translation script for multiple languages
│ │
│ ├── evaluation/
│ │ ├── choice_analysis.csv # Choice-type questions analysis
│ │ ├── combined_data.csv # Merged dataset of all evaluations
│ │ ├── eval.py # Main evaluation script
│ │ ├── inference_bias_analysis.csv # Inference bias analysis
│ │ ├── model_bias_analysis.csv # Model bias analysis
│ │ ├── open_analysis.csv # Open-ended-type responses analysis
│ │ ├── persona_analysis.csv # Persona-type questions analysis
│ │ ├── related_analysis.csv # Related/non-related nations analysis
│ │ ├── tf_analysis.csv # True/false-type question analysis
│ │ ├── topic_analysis.csv # Topic analysis
│
│ ├── model_response/ # Responses generated by models
│ │ ├── result/
│ │ │ ├── chatgpt.json # ChatGPT-generated responses
│ │ │ ├── cn.json # Responses from Chinese model
│ │ │ ├── jp.json # Responses from Japanese model
│ │ │ ├── kr.json # Responses from Korean model
│ │ │ ├── us.json # Responses from US-based model
│ │ ├── get_response.py # Script to fetch model responses
│
│── factual_qa/ # Data and evaluation scripts for factual (objective) topics
│ ├── dataset/
│ │ ├── dataset.json # JSON dataset for factual QA (translated and manually verifed into four languages)
│ │ ├── questions.csv # Raw questions for factual QA (only Korean; before translation)
│ │ ├── translation.py # Translation script for multiple languages
│ │
│ ├── evaluation/
│ │ ├── human/ # Human evaluation
│ │ │ ├── chatgpt.py # Human evaluation of ChatGPT responses
│ │ │ ├── cn.py # Human evaluation of Chinese model responses
│ │ │ ├── jp.py # Human evaluation of Japanese model responses
│ │ │ ├── kr.py # Human evaluation of Korean model responses
│ │ │ ├── us.py # Human evaluation of US model responses
│ │ ├── model_based/ # Model-based evaluation
│ │ │ ├── chatgpt.py # Model-based evaluation of ChatGPT responses
│ │ │ ├── cn.py # Model-based evaluation of Chinese model responses
│ │ │ ├── jp.py # Model-based evaluation of Japanese model responses
│ │ │ ├── kr.py # Model-based evaluation of Korean model responses
│ │ │ ├── us.py # Model-based evaluation of US model responses
│ │
│ ├── model_response/ # Responses generated by models
│ │ ├── result/
│ │ │ ├── chatgpt.json # GPT-generated responses
│ │ │ ├── cn.json # Responses from Chinese model
│ │ │ ├── jp.json # Responses from Japanese model
│ │ │ ├── kr.json # Responses from Korean model
│ │ │ ├── us.json # Responses from US-based model
│ │ ├── chatgpt.py # Script to fetch ChatGPT responses
│ │ ├── cn.py # Script to fetch Chinese model responses
│ │ ├── jp.py # Script to fetch Japanese model responses
│ │ ├── kr.py # Script to fetch Korean model responses
│ │ ├── us.py # Script to fetch US model responses
│
│── README.md # Documentation (this file)```
## 2. Overview of the Study
This study investigates biases in LLMs through two phases:
1. **Factual QA Evaluation**: Measures how models handle objective knowledge in different languages.
2. **Disputable QA Evaluation**: Analyzes how models respond to geopolitical and historical disputes.We define two bias types:
- **Model Bias**: The tendency to generate answers aligned with the model's primary training language.
- **Inference Bias**: The tendency to generate answers aligned with the language of the query.## 3. Dataset Construction
The dataset includes:
- **Factual QA (Objective Knowledge)**: 70 factual questions covering country names, government structure, and official policies.
- **Disputable QA (Geopolitical Conflicts)**: 4 major disputes analyzed using open-ended, persona-based, true/false, and multiple-choice questions.Translations were generated using GPT-4o and manually verified.
## 4. Evaluation Approach
We evaluate responses using:
1. **Model-based Evaluation (GPT-4o)**: Determines if responses match expected answers.
2. **Human Evaluation**: Classifies responses based on national perspective, neutrality, or refusal to answer.
*Only human evaluation was done for phase 2, preventing additional model-bias for sensitive topics***Evaluation Metrics**:
- **Model Bias Rate** = (# of model-language-aligned responses) / (Total questions)
- **Inference Bias Rate** = (# of input-language-aligned responses) / (Total questions)
- Neutral Response Rate = (# of responses that do not match any national stance) / (Total questions)## 5. Key Findings
- Inference bias dominates factual QA: Responses tend to align with the query language rather than training data.
- Model bias is stronger in political disputes: LLMs tend to align with national perspectives in subjective topics.
- ChatGPT & US models attempt neutrality but exhibit topic-dependent biases.
- Question structure influences responses: Open-ended questions lead to avoidance, while multiple-choice enforces clear biases.## 6. How to Run This Code
### Generating Responses
For example, to generate responses from ChatGPT for phase 2:
```
cd disputable_qa/model_response
python get_response.py
```### Running Evaluations
For example, to run evaluations for phase 2:
```
cd disputable_qa/evaluation
python eval.py
```For example, for factual QA (phase 1) model_based evaluations:
```
cd factual_qa/evaluation/model_based
python chatgpt.py # (or cn.py, jp.py, kr.py, us.py)
```