https://github.com/shaadclt/ragas-synthetic-test-data-generation

This project demonstrates how to generate synthetic test data for Retrieval Augmented Generation (RAG) using Ragas.
https://github.com/shaadclt/ragas-synthetic-test-data-generation

ragas synthetic-data-generation test-data-generator

Last synced: 6 months ago
JSON representation

This project demonstrates how to generate synthetic test data for Retrieval Augmented Generation (RAG) using Ragas.

Host: GitHub
URL: https://github.com/shaadclt/ragas-synthetic-test-data-generation
Owner: shaadclt
License: mit
Created: 2024-08-17T16:38:38.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-08-17T16:56:08.000Z (about 1 year ago)
Last Synced: 2025-04-10T01:11:22.105Z (6 months ago)
Topics: ragas, synthetic-data-generation, test-data-generator
Language: Jupyter Notebook
Homepage:
Size: 102 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          # Synthetic Test Data Generation using LangChain, Ragas, and Groq API

This project demonstrates how to generate synthetic test data for Retrieval Augmented Generation (RAG) using Ragas. 

## Table of Contents

- [Installation](#installation)

- [Usage](#usage)

  - [Environment Setup](#environment-setup)

  - [Load Documents from PubMed](#load-documents-from-pubmed)

  - [Generate Test Sets](#generate-test-sets)

- [Output](#output)

- [References](#references)

- [License](#license)

## Installation

To get started, clone the repository and install the required dependencies.

```python

!pip install ragas langchain_community langchain_groq sentence_transformers xmltodict -q

```

## Usage

### Environment Setup

1. Import necessary libraries and set up environment variables.

2. Initialize the Groq API for data generation and critique models.

3. Set up the HuggingFace BGE embeddings for document processing.

### Load Documents from PubMed

Use the `PubMedLoader` from `langchain_community` to load documents related to a specific query (e.g., "cancer"). In this project, we load a maximum of 5 documents.

```python

loader = PubMedLoader("cancer", load_max_docs=5)

documents = loader.load()

```

### Generate Test Sets

We use the TestsetGenerator from ragas to generate test sets based on the loaded documents. The test set generation includes:

- Simple questions: 50%

- Multi-context questions: 40%

- Reasoning-based questions: 10%

  

The following code sets up the test set generation:

```python

generator = TestsetGenerator.from_langchain(

    data_generation_model,

    critic_model,

    embeddings

)

distributions = {

    simple: 0.5,

    multi_context: 0.4,

    reasoning: 0.1

}

testset = generator.generate_with_langchain_docs(documents, 5, distributions)

test_df = testset.to_pandas()

```

### Output

The output is a Pandas DataFrame containing the generated test sets, which can be further analyzed or used for model evaluation.

## License

This project is licensed under the [MIT License](/LICENSE.txt).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shaadclt/ragas-synthetic-test-data-generation

Awesome Lists containing this project

README