https://github.com/teilomillet/kushim
eval creator
https://github.com/teilomillet/kushim
dataset dataset-generation eval evaluation evaluation-framework llm openai
Last synced: 23 days ago
JSON representation
eval creator
- Host: GitHub
- URL: https://github.com/teilomillet/kushim
- Owner: teilomillet
- Created: 2025-06-14T12:00:47.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-06-27T09:12:17.000Z (4 months ago)
- Last Synced: 2025-06-27T09:32:01.746Z (4 months ago)
- Topics: dataset, dataset-generation, eval, evaluation, evaluation-framework, llm, openai
- Language: Python
- Homepage:
- Size: 505 KB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Kushim: A Framework for Verifiable, Self-Optimizing LLM Evaluation Datasets
Kushim is a framework for generating high-quality, verifiable Question & Answer datasets. In an era of generative models, creating reliable evaluation data is one of the biggest challenges. Kushim addresses this by providing an end-to-end workflow built on two core principles: **verifiability by design** and **self-optimizing quality**.
It's not just about generating data; it's about generating *trustworthy* data that gets *better* on its own.
![]()
## The Kushim Philosophy: Core Concepts
### 1. Verifiable by Design
The biggest risk with synthetic data is factual inconsistency. Kushim is built to mitigate this risk. Every single question-answer pair generated by the pipeline is subjected to a strict validation step. An LLM-based validator checks if the generated answer is factually and unambiguously supported by the original source text. If a pair fails this check, it's discarded.
This ensures that your final dataset isn't just a collection of plausible-sounding questions, but a set of verifiable facts grounded in a source of truth.
### 2. Self-Optimizing Quality with DSPy
A static, one-size-fits-all prompt is not optimal. The best way to phrase a question depends on the source material. Kushim leverages the power of **DSPy** to create a self-improving pipeline.
Instead of just running a prompt, Kushim can "compile" it. It uses DSPy's optimizers (`teleprompters`) to:
1. Generate a small training set from your source documents.
2. Test multiple variations of prompts to see which ones produce the highest-quality, most verifiable Q&A pairs *for your specific data*.
3. Save this "compiled" program, which contains the optimized, high-performance prompts.This means Kushim learns from your data to improve its own performance, leading to a significantly higher-quality final dataset.
## How It Works: The Kushim Pipeline Workflow
The Kushim pipeline integrates these concepts into an efficient, streaming workflow that proceeds in the following stages:
1. **Source & Fetch**: The process begins by fetching raw documents from a designated `Source`, such as a Wikipedia article or a local file directory.
2. **Chunking**: The fetched documents are broken down into smaller, manageable text chunks. This is a standard practice in RAG-style pipelines and prepares the data for the generation models.
3. **Self-Optimization (A One-Time "Compile" Step)**: This is the heart of Kushim's quality assurance process. Instead of using a static prompt, the pipeline:
* **Generates a Training Set**: It takes a small sample of the chunks to create a temporary training set.
* **Optimizes Prompts**: It uses DSPy to test multiple prompt variations, identifying the one that produces the highest-quality, most verifiable Q&A pairs for *your specific data*.
* This "compiled" program, containing the optimized prompts, is saved and used for the main generation task.4. **Generation**: Using the high-performance prompts from the compilation step, the pipeline generates question-answer pairs from all of the text chunks.
5. **Validation & Filtering**: Each generated Q&A pair is rigorously validated. An LLM checks if the answer is factually supported by its original source chunk. Pairs that pass validation proceed to the final dataset; those that fail are discarded.
This multi-stage process ensures that the final output is not only relevant but also verifiable and of the highest possible quality.
## Getting Started
After installing Kushim (`uv add kushim` or `pip install kushim`), you can use its core components directly. The key is to instantiate the pipeline and run it. The optimization is handled for you—the first run compiles and saves the best prompts, and subsequent runs are fast.
```python
# A conceptual example of using the Kushim pipeline
from kushim import pipeline, config, source# 1. Choose your source and model
data_source = source.WikipediaSource()
pipeline_config = config.KushimConfig(
model_name="openrouter/openai/gpt-4.1",
fetch_kwargs={"mode": "search", "query": "History of coffee"}
)# 2. Instantiate the pipeline
kushim_pipeline = pipeline.KushimPipeline(
source=data_source,
config=pipeline_config
)# 3. Run it!
# This will automatically handle compiling and saving the optimized
# generator to a .json file for you on the first run.
validated_dataset, _ = kushim_pipeline.run(
optimize=True,
compiled_generator_path="compiled_coffee_generator.json"
)print(validated_dataset)
```For complete, runnable scripts demonstrating the full dataset creation lifecycle (merging, encryption, and pushing to the Hugging Face Hub), please see the `examples/` directory in the [GitHub repository](https://github.com/teilomillet/kushim).