https://github.com/artefactop/promptdev
A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.
https://github.com/artefactop/promptdev
ci-cd evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework prompt prompt-engineering prompt-toolkit red-team testing
Last synced: 5 months ago
JSON representation
A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.
- Host: GitHub
- URL: https://github.com/artefactop/promptdev
- Owner: artefactop
- License: mit
- Created: 2025-09-05T16:52:09.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-09-22T19:55:14.000Z (7 months ago)
- Last Synced: 2025-09-30T21:12:14.711Z (6 months ago)
- Topics: ci-cd, evaluation-framework, llm, llm-eval, llm-evaluation, llm-evaluation-framework, prompt, prompt-engineering, prompt-toolkit, red-team, testing
- Language: Python
- Homepage:
- Size: 1.03 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Promptdev
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/astral-sh/ruff)
[](https://github.com/artefactop/promptdev/actions/workflows/ci.yml)
[](https://codecov.io/gh/artefactop/promptdev)
[](https://github.com/artefactop/promptdev/actions/workflows/security.yml)
`promptdev` is a prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.

> [!WARNING]
>
> promptdev is in preview and is not ready for production use.
>
> We're working hard to make it stable and feature-complete, but until then, expect to encounter bugs,
> missing features, and fatal errors.
## Features
- ๐ **Type Safe** - Full Pydantic validation for inputs, outputs, and configurations
- ๐ค **PydanticAI Integration** - Native support for PydanticAI agents (in progress) and [evaluation framework](https://ai.pydantic.dev/evals/)
- ๐ **Multi-Provider Testing** - Test across OpenAI, Together.ai, Ollama, Bedrock, and [more](https://ai.pydantic.dev/models/overview/)
- โก **Performance Optimized** - File-based caching with TTL for faster repeated evaluations
- ๐ **Rich Reporting** - Beautiful console output with detailed failure analysis and provider comparisons
- ๐งช **Promptfoo Compatible** - Works with (some) existing promptfoo YAML configs and datasets
- ๐ฏ **Comprehensive Assertions** - Built-in evaluators plus custom Python assertion support
## Quick Start
### Installation
#### From PyPI (alpha version)
```bash
pip install promptdev --pre
```
#### From Source
```bash
git clone https://github.com/artefactop/promptdev.git
cd promptdev
pip install -e .
```
#### For Development
```bash
git clone https://github.com/artefactop/promptdev.git
cd promptdev
uv sync
uv run promptdev --help
```
### Basic Usage
#### If installed via pip:
```bash
# Run evaluation (simple demo)
promptdev eval examples/demo/config.yaml
# Run evaluation (advanced example)
promptdev eval examples/calendar_event_summary/config.yaml
# Disable caching for a run
promptdev eval examples/demo/config.yaml --no-cache
# Export results
promptdev eval examples/demo/config.yaml --output json
promptdev eval examples/demo/config.yaml --output html
# Validate configuration
promptdev validate examples/demo/config.yaml
# Cache management
promptdev cache stats
promptdev cache clear
```
#### If running from source:
```bash
uv run promptdev --help
```
## Assertion Types
Promptdev supports a comprehensive set of evaluators for different testing scenarios:
| Type | Description |
|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `equals` | Checks if the output exactly equals the provided value |
| `contains` | Checks if the output contains the expected output |
| `is_instance` | Checks if the output is an instance of a type with the given name |
| `max_duration` | Checks if the execution time is under the specified maximum |
| `is_json` | Checks if the output is a valid JSON string (optional json schema validation) |
| `contains_json` | Checks if the output contains a valid json (optional json schema validation) |
| `python` | [Promptfoo compatible](https://www.promptfoo.dev/docs/configuration/expected-outputs/python/#external-py) Allows you to provide a custom Python function to validate the LLM output |
## Configuration
Promptdev uses YAML configuration files compatible with [Promptfoo](https://www.promptfoo.dev/docs/configuration/reference/) format, but only a subset is available for now:
### Promptfoo Compatibility
Promptdev maintains compatibility with promptfoo configurations to ease migration:
> To migrate if you are using ids with format `provider:chat|completion:model`, just remove the middle part `provider:model`, promptdev only supports chat.
>
> Some provider name can change for example `togetherai` is now `togeher`. Refer to [pydantic_ai models](https://ai.pydantic.dev/models/overview/) for the full list.
- **YAML configs** - Most promptfoo YAML configs work with minimal changes
- **JSONL datasets** - Existing test datasets are fully supported
- **Python assertions** - Custom `get_assert` functions work without modification
- **JSON schemas** - Schema validation uses the same format
> [!WARNING]
> Promptdev can run custom Python assertions. While powerful,
> running arbitrary Python code always comes with [security issues](https://github.com/pydantic/pydantic-ai/pull/2808).
> Use this feature only with code you trust.
Example of a Python assertion:
```python
# tests/data/python_assert.py
from typing import Any
def get_assert(output:str, context:dict) -> bool | float | dict[str, Any]:
"""Test assertion that checks if output contains 'success'."""
return "success" in str(output).lower()
```
## Development
```bash
# Setup development environment
uv sync
# Run tests
uv run pytest
# Format and lint code
uv run ruff check . --fix
uv run ruff format .
# Type checking
uv run ty check
```
## Roadmap
- [x] Core evaluation engine with PydanticAI integration
- [x] Multi-provider support for major AI platforms
- [x] YAML configuration loading with promptfoo compatibility
- [x] Comprehensive assertion types (JSON schema, Python, LLM-based)
- [x] File-based caching system with TTL support
- [x] Rich console reporting with failure analysis
- [x] Simple file disk cache
- [x] Better integration with PydanticAI, do not reinvent the wheel
- [x] Concurrent execution using PydanticAI natively, for faster large-scale evaluations
- [ ] Code cleanup
- [ ] Testing
- [ ] Testing promptfoo files
- [ ] Native support for PydanticAI agents
- [ ] Add support to run multiple config files with one command
- [ ] CI/CD integration helpers with change detection
- [ ] SQLite persistence for evaluation history and analytics
- [ ] Performance benchmarking and regression detection
## Contributing
We welcome contributions! Here's how to get started:
1. Fork the repository
2. Create a feature branch: `git checkout -b feature/amazing-feature`
3. Install development dependencies: `uv sync`
4. Make your changes and add tests
5. Run tests: `uv run pytest`
6. Commit your changes: `git commit -m 'Add amazing feature'`
7. Push to the branch: `git push origin feature/amazing-feature`
8. Open a Pull Request
### Code Style
We use `ruff` for code formatting and linting, `ty` for type checking, and `pytest` for testing. Please ensure your code follows these standards:
```bash
uv run ruff check . # Lint code
uv run ruff format . # Format code
uv run ty check # Type checking
uv run pytest # Run tests
```
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- Built on [PydanticAI](https://ai.pydantic.dev/) for type-safe AI agent development
- Inspired by [promptfoo](https://github.com/promptfoo/promptfoo) for evaluation concepts
- Uses [Rich](https://github.com/Textualize/rich) for beautiful console output