https://github.com/humankernel/rag-eval
Create syntetic datasets for RAG evaluation
https://github.com/humankernel/rag-eval
gradio rag-metrics ragas-evaluation vllm
Last synced: 16 days ago
JSON representation
Create syntetic datasets for RAG evaluation
- Host: GitHub
- URL: https://github.com/humankernel/rag-eval
- Owner: humankernel
- Created: 2025-03-23T19:12:38.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-11-03T23:23:03.000Z (8 months ago)
- Last Synced: 2025-11-04T01:16:56.189Z (8 months ago)
- Topics: gradio, rag-metrics, ragas-evaluation, vllm
- Language: Python
- Homepage:
- Size: 1.66 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ง WikiQA Dataset Creator

**WikiQA** is a tool for generating **synthetic questionโanswer datasets** using **Wikipedia** and **Large Language Models (LLMs)**.
It was developed to support the evaluation of **Retrieval-Augmented Generation (RAG)** systems, particularly [this RAG evaluator](https://github.com/humankernel/rag-revamped).
## ๐ Selected Wikipedia Topics
### ๐งฎ Mathematics
* [Prime Numbers](https://en.wikipedia.org/wiki/Prime_number)
* [Linear Algebra](https://en.wikipedia.org/wiki/Linear_algebra)
* [Calculus](https://en.wikipedia.org/wiki/Calculus)
* [Probability](https://en.wikipedia.org/wiki/Probability)
### ๐ป Computer Science
* [Algorithm](https://en.wikipedia.org/wiki/Algorithm)
* [Data Structure](https://en.wikipedia.org/wiki/Data_structure)
* [Artificial Intelligence](https://en.wikipedia.org/wiki/Artificial_intelligence)
* [Computer Programming](https://en.wikipedia.org/wiki/Computer_programming)
### ๐งฌ Biology
* [Cell (biology)](https://en.wikipedia.org/wiki/Cell_%28biology%29)
* [Genetics](https://en.wikipedia.org/wiki/Genetics)
* [Evolution](https://en.wikipedia.org/wiki/Evolution)
* [Ecology](https://en.wikipedia.org/wiki/Ecology)
### โ๏ธ Physics
* [Classical Mechanics](https://en.wikipedia.org/wiki/Classical_mechanics)
* [Electromagnetism](https://en.wikipedia.org/wiki/Electromagnetism)
* [Quantum Mechanics](https://en.wikipedia.org/wiki/Quantum_mechanics)
* [Thermodynamics](https://en.wikipedia.org/wiki/Thermodynamics)
### ๐ General Topics
* [Batman](https://en.wikipedia.org/wiki/Batman)
* [Dachshund](https://en.wikipedia.org/wiki/Dachshund)
* [Conspiracy Theory](https://en.wikipedia.org/wiki/Conspiracy_theory)
* [Religion](https://en.wikipedia.org/wiki/Religion)
## ๐งฉ Question Types
Each dataset entry belongs to one of several **cognitive and reasoning categories**, enabling targeted evaluation of RAG models:
1. โ
**Factual** โ objective, verifiable facts.
2. ๐ **Multi-Hop** โ multi-step reasoning or combined facts.
3. ๐ง **Semantic** โ interpretation and meaning of concepts.
4. โ๏ธ **Logical Reasoning** โ applying formal rules or laws.
5. ๐ก **Creative Thinking** โ open-ended or hypothetical reasoning.
6. ๐ **Problem-Solving** โ applying formulas or methods to compute results.
7. โ๏ธ **Ethical & Philosophical** โ moral or conceptual reflection on science.
Each question type is designed to stress different aspects of retrieval and generation in RAG systems.
## ๐ Evaluation Metrics
Although WikiQA only generates datasets, it is designed around **RAG evaluation metrics** (see [Key Metrics and Evaluation Methods for RAG](https://www.youtube.com/watch?v=cRz0BWkuwHg)).
### ๐ Retrieval Metrics
| Metric | Measures | Description |
| -------------------------------- | --------------------------- | ----------------------------------------------------- |
| **Precision** | Relevance of retrieved docs | Fraction of retrieved documents that are relevant |
| **Recall** | Coverage of relevant docs | Fraction of relevant documents that were retrieved |
| **Hit Rate** | Top-result success | % of queries retrieving โฅ1 relevant doc in top-k |
| **MRR (Mean Reciprocal Rank)** | Top result position | Measures how high the first relevant doc ranks |
| **NDCG** | Ranking quality | Evaluates both relevance and order of retrieved docs |
| **MAP (Mean Average Precision)** | Overall retrieval accuracy | Averages precision over all relevant docs and queries |
### โ๏ธ Generation Metrics
| Metric | Measures | Example |
| ---------------------- | ------------------------------------- | ---------------------------------------------------- |
| **Faithfulness** | Factual consistency with context | โEinstein was born in Germany on March 14, 1879.โ |
| **Answer Relevance** | How well the answer fits the question | Adds missing but relevant info like France โ โParisโ |
| **Answer Correctness** | Alignment with ground truth | Matches true reference answer accurately |
## โ๏ธ Example Use Case
This tool can be used to:
* Build **synthetic QA datasets** for RAG benchmark testing.
* Evaluate the **retrieval** and **generation** quality of LLM-based systems.
* Train or fine-tune **retrieval models** on domain-specific scientific content.
## ๐ง Related Projects
* ๐ **RAG Evaluator:** [humankernel/rag-revamped](https://github.com/humankernel/rag-revamped)
* ๐งพ **Undergraduate Thesis:** [humankernel/thesis](https://humankernel.github.io/thesis/main.pdf)