https://github.com/shaadclt/ragas-synthetic-test-data-generation
This project demonstrates how to generate synthetic test data for Retrieval Augmented Generation (RAG) using Ragas.
https://github.com/shaadclt/ragas-synthetic-test-data-generation
ragas synthetic-data-generation test-data-generator
Last synced: 6 months ago
JSON representation
This project demonstrates how to generate synthetic test data for Retrieval Augmented Generation (RAG) using Ragas.
- Host: GitHub
- URL: https://github.com/shaadclt/ragas-synthetic-test-data-generation
- Owner: shaadclt
- License: mit
- Created: 2024-08-17T16:38:38.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-17T16:56:08.000Z (about 1 year ago)
- Last Synced: 2025-04-10T01:11:22.105Z (6 months ago)
- Topics: ragas, synthetic-data-generation, test-data-generator
- Language: Jupyter Notebook
- Homepage:
- Size: 102 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Synthetic Test Data Generation using LangChain, Ragas, and Groq API
This project demonstrates how to generate synthetic test data for Retrieval Augmented Generation (RAG) using Ragas.
## Table of Contents
- [Installation](#installation)
- [Usage](#usage)
- [Environment Setup](#environment-setup)
- [Load Documents from PubMed](#load-documents-from-pubmed)
- [Generate Test Sets](#generate-test-sets)
- [Output](#output)
- [References](#references)
- [License](#license)## Installation
To get started, clone the repository and install the required dependencies.
```python
!pip install ragas langchain_community langchain_groq sentence_transformers xmltodict -q
```## Usage
### Environment Setup
1. Import necessary libraries and set up environment variables.
2. Initialize the Groq API for data generation and critique models.
3. Set up the HuggingFace BGE embeddings for document processing.### Load Documents from PubMed
Use the `PubMedLoader` from `langchain_community` to load documents related to a specific query (e.g., "cancer"). In this project, we load a maximum of 5 documents.```python
loader = PubMedLoader("cancer", load_max_docs=5)
documents = loader.load()
```### Generate Test Sets
We use the TestsetGenerator from ragas to generate test sets based on the loaded documents. The test set generation includes:- Simple questions: 50%
- Multi-context questions: 40%
- Reasoning-based questions: 10%
The following code sets up the test set generation:```python
generator = TestsetGenerator.from_langchain(
data_generation_model,
critic_model,
embeddings
)
distributions = {
simple: 0.5,
multi_context: 0.4,
reasoning: 0.1
}
testset = generator.generate_with_langchain_docs(documents, 5, distributions)
test_df = testset.to_pandas()
```### Output
The output is a Pandas DataFrame containing the generated test sets, which can be further analyzed or used for model evaluation.## License
This project is licensed under the [MIT License](/LICENSE.txt).