https://github.com/gametimesf/gt_llm_evaluator

Regression Testing for our Chatbot
https://github.com/gametimesf/gt_llm_evaluator

Last synced: 4 months ago
JSON representation

Regression Testing for our Chatbot

Host: GitHub
URL: https://github.com/gametimesf/gt_llm_evaluator
Owner: gametimesf
Created: 2025-05-21T22:45:39.000Z (about 1 year ago)
Default Branch: master
Last Pushed: 2025-06-18T20:37:37.000Z (12 months ago)
Last Synced: 2026-02-05T20:29:40.043Z (4 months ago)
Language: Python
Size: 177 KB
Stars: 0
Watchers: 5
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# LLM Evaluator Service

A modular service for static LLM evaluation of LLM outputs.

## Requirements

- Python 3.12 or higher
- Dependencies are managed through `pyproject.toml`

## Setup

1. Install UV (if not already installed):

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

2. Create and activate a virtual environment using UV:

```bash
uv venv
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On Windows
```

3. Install dependencies using UV:

```bash
uv pip install -e .
```

## Available Scripts

### Chatbot Evaluation Scripts

Located in `scripts/chatbot/`:

- `simulate_convo.py` - Creates simulated conversations for testing and evaluation purposes
- Usage: `uv run scripts/chatbot/simulate_convo.py`

- `pre_merge_check.py` - Runs validation checks before merging code changes
- Usage: `uv run scripts/chatbot/pre_merge_check.py`

- `nightly_report.py` - Generates daily evaluation reports
- Usage: `uv run scripts/chatbot/nightly_report.py`

### Main Evaluation Script

`convo_eval.py`

- Core evaluation script for analyzing conversations
- Usage: `uv run convo_eval.py`

### FAQ Generator Scripts

Located in `scripts/faq_generator/`:

- `faq_eval.py` - Evaluates FAQ content using DeepEval metrics
- Usage: `uv run scripts/faq_generator/faq_eval.py --input "Your prompt" --content "Generated FAQ content" --context "Reference material"`
- Required arguments:
- `--input`: The input prompt text used to generate the FAQ
- `--content`: The generated FAQ content to evaluate
- `--context`: The reference material or ground truth to check against
- Output: Generates a CSV file in `deepeval_results/faq_eval/` with evaluation metrics including:
- Hallucination score
- Evaluation reasoning
- Cost metrics
- Requirements:
- DEEPEVAL_API_KEY environment variable must be set
- Python 3.12 or higher
- DeepEval package installed

## Project Structure

- `src/` - Source code directory
- `scripts/` - Utility scripts for various tasks
- `chatbot/` - Chatbot evaluation and testing scripts
- `faq_generator/` - FAQ generation scripts
- `mock_data/` - Sample data for testing
- `deepeval_results/` - Output directory for evaluation results

## Dependencies

Main dependencies include:

- deepeval (>=2.7.6)
- deepteam (>=0.0.9)
- Google API Client Libraries
- python-dotenv

## Environment Variables

The project uses environment variables for configuration. Create a `.env` file in the root directory with necessary credentials and settings.

## Contributing

1. Follow the existing code structure and style
2. Update documentation as needed

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gametimesf/gt_llm_evaluator

Awesome Lists containing this project

README