An open API service indexing awesome lists of open source software.

https://github.com/humankernel/rag-eval

Create syntetic datasets for RAG evaluation
https://github.com/humankernel/rag-eval

gradio rag-metrics ragas-evaluation vllm

Last synced: 16 days ago
JSON representation

Create syntetic datasets for RAG evaluation

Awesome Lists containing this project

README

          

# ๐Ÿง  WikiQA Dataset Creator

![screenshot](paper/ui.png)

**WikiQA** is a tool for generating **synthetic questionโ€“answer datasets** using **Wikipedia** and **Large Language Models (LLMs)**.
It was developed to support the evaluation of **Retrieval-Augmented Generation (RAG)** systems, particularly [this RAG evaluator](https://github.com/humankernel/rag-revamped).

## ๐Ÿ“š Selected Wikipedia Topics

### ๐Ÿงฎ Mathematics

* [Prime Numbers](https://en.wikipedia.org/wiki/Prime_number)
* [Linear Algebra](https://en.wikipedia.org/wiki/Linear_algebra)
* [Calculus](https://en.wikipedia.org/wiki/Calculus)
* [Probability](https://en.wikipedia.org/wiki/Probability)

### ๐Ÿ’ป Computer Science

* [Algorithm](https://en.wikipedia.org/wiki/Algorithm)
* [Data Structure](https://en.wikipedia.org/wiki/Data_structure)
* [Artificial Intelligence](https://en.wikipedia.org/wiki/Artificial_intelligence)
* [Computer Programming](https://en.wikipedia.org/wiki/Computer_programming)

### ๐Ÿงฌ Biology

* [Cell (biology)](https://en.wikipedia.org/wiki/Cell_%28biology%29)
* [Genetics](https://en.wikipedia.org/wiki/Genetics)
* [Evolution](https://en.wikipedia.org/wiki/Evolution)
* [Ecology](https://en.wikipedia.org/wiki/Ecology)

### โš›๏ธ Physics

* [Classical Mechanics](https://en.wikipedia.org/wiki/Classical_mechanics)
* [Electromagnetism](https://en.wikipedia.org/wiki/Electromagnetism)
* [Quantum Mechanics](https://en.wikipedia.org/wiki/Quantum_mechanics)
* [Thermodynamics](https://en.wikipedia.org/wiki/Thermodynamics)

### ๐ŸŒ General Topics

* [Batman](https://en.wikipedia.org/wiki/Batman)
* [Dachshund](https://en.wikipedia.org/wiki/Dachshund)
* [Conspiracy Theory](https://en.wikipedia.org/wiki/Conspiracy_theory)
* [Religion](https://en.wikipedia.org/wiki/Religion)

## ๐Ÿงฉ Question Types

Each dataset entry belongs to one of several **cognitive and reasoning categories**, enabling targeted evaluation of RAG models:

1. โœ… **Factual** โ€“ objective, verifiable facts.
2. ๐Ÿ”— **Multi-Hop** โ€“ multi-step reasoning or combined facts.
3. ๐Ÿง  **Semantic** โ€“ interpretation and meaning of concepts.
4. โš™๏ธ **Logical Reasoning** โ€“ applying formal rules or laws.
5. ๐Ÿ’ก **Creative Thinking** โ€“ open-ended or hypothetical reasoning.
6. ๐Ÿ“ **Problem-Solving** โ€“ applying formulas or methods to compute results.
7. โš–๏ธ **Ethical & Philosophical** โ€“ moral or conceptual reflection on science.

Each question type is designed to stress different aspects of retrieval and generation in RAG systems.

## ๐Ÿ“Š Evaluation Metrics

Although WikiQA only generates datasets, it is designed around **RAG evaluation metrics** (see [Key Metrics and Evaluation Methods for RAG](https://www.youtube.com/watch?v=cRz0BWkuwHg)).

### ๐Ÿ” Retrieval Metrics

| Metric | Measures | Description |
| -------------------------------- | --------------------------- | ----------------------------------------------------- |
| **Precision** | Relevance of retrieved docs | Fraction of retrieved documents that are relevant |
| **Recall** | Coverage of relevant docs | Fraction of relevant documents that were retrieved |
| **Hit Rate** | Top-result success | % of queries retrieving โ‰ฅ1 relevant doc in top-k |
| **MRR (Mean Reciprocal Rank)** | Top result position | Measures how high the first relevant doc ranks |
| **NDCG** | Ranking quality | Evaluates both relevance and order of retrieved docs |
| **MAP (Mean Average Precision)** | Overall retrieval accuracy | Averages precision over all relevant docs and queries |

### โœ๏ธ Generation Metrics

| Metric | Measures | Example |
| ---------------------- | ------------------------------------- | ---------------------------------------------------- |
| **Faithfulness** | Factual consistency with context | โ€œEinstein was born in Germany on March 14, 1879.โ€ |
| **Answer Relevance** | How well the answer fits the question | Adds missing but relevant info like France โ†’ โ€œParisโ€ |
| **Answer Correctness** | Alignment with ground truth | Matches true reference answer accurately |

## โš™๏ธ Example Use Case

This tool can be used to:

* Build **synthetic QA datasets** for RAG benchmark testing.
* Evaluate the **retrieval** and **generation** quality of LLM-based systems.
* Train or fine-tune **retrieval models** on domain-specific scientific content.

## ๐Ÿง  Related Projects

* ๐Ÿ”— **RAG Evaluator:** [humankernel/rag-revamped](https://github.com/humankernel/rag-revamped)
* ๐Ÿงพ **Undergraduate Thesis:** [humankernel/thesis](https://humankernel.github.io/thesis/main.pdf)