{"id":50396209,"url":"https://github.com/Aysnc-Labs/llm-eval","last_synced_at":"2026-06-16T13:00:38.386Z","repository":{"id":337608742,"uuid":"1131636156","full_name":"Aysnc-Labs/llm-eval","owner":"Aysnc-Labs","description":"A PHP package for evaluating LLM outputs. Test your prompts, validate responses, and ensure your AI features work correctly.","archived":false,"fork":false,"pushed_at":"2026-02-10T14:08:24.000Z","size":231,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-10T22:58:06.840Z","etag":null,"topics":["llm","llm-eval","llm-evaluation","llm-evaluation-framework","llm-evaluation-toolkit","php"],"latest_commit_sha":null,"homepage":"https://packagist.org/packages/aysnc/llm-eval","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Aysnc-Labs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-10T12:05:01.000Z","updated_at":"2026-02-10T14:10:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Aysnc-Labs/llm-eval","commit_stats":null,"previous_names":["aysnc-labs/llm-eval"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/Aysnc-Labs/llm-eval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aysnc-Labs%2Fllm-eval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aysnc-Labs%2Fllm-eval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aysnc-Labs%2Fllm-eval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aysnc-Labs%2Fllm-eval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Aysnc-Labs","download_url":"https://codeload.github.com/Aysnc-Labs/llm-eval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aysnc-Labs%2Fllm-eval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34406824,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-16T02:00:06.860Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","llm-eval","llm-evaluation","llm-evaluation-framework","llm-evaluation-toolkit","php"],"created_at":"2026-05-30T21:01:21.423Z","updated_at":"2026-06-16T13:00:38.374Z","avatar_url":"https://github.com/Aysnc-Labs.png","language":"PHP","funding_links":[],"categories":["Browse The Shelves"],"sub_categories":["Agent evals"],"readme":"# LLM-Eval\n\n![GitHub Actions](https://github.com/Aysnc-Labs/llm-eval/actions/workflows/test.yml/badge.svg)\n![Maintenance](https://img.shields.io/badge/Actively%20Maintained-yes-green.svg)\n\nA PHP package for evaluating LLM outputs. Test your prompts, validate responses, and ensure your AI features work correctly.\n\n## Installation\n\n```bash\ncomposer require aysnc/llm-eval\n```\n\n## Configuration\n\nCreate `llm-eval.php` in your project root:\n\n```php\n\u003c?php\n\nuse Aysnc\\AI\\LlmEval\\Providers\\AnthropicProvider;\n\nreturn [\n    'provider'    =\u003e new AnthropicProvider(getenv('ANTHROPIC_API_KEY')),\n    'directory'   =\u003e __DIR__ . '/evals',\n    'cache'       =\u003e true,\n    'cacheTtl'    =\u003e 0,\n    'parallel'    =\u003e false,\n    'concurrency' =\u003e 5,\n];\n```\n\n| Option | Type | Default | Description |\n|---|---|---|---|\n| `provider` | `ProviderInterface` | — | The LLM provider shared across all eval files |\n| `directory` | `string` | `'evals'` | Directory containing your eval files |\n| `cache` | `bool\\|string` | `false` | `true` uses `.llm-cache/`, or pass a custom path |\n| `cacheTtl` | `int` | `0` | Cache lifetime in seconds (`0` = forever) |\n| `parallel` | `bool` | `false` | Run evals in parallel by default |\n| `concurrency` | `int` | `0` | Max concurrent requests when parallel (`0` = unlimited) |\n\n## Quick Start\n\n**1. Create an eval file** in your evals directory. Each file **returns** an `LlmEval` instance:\n\n```php\n// evals/simple.php\n\u003c?php\n\nuse Aysnc\\AI\\LlmEval\\Dataset\\Dataset;\nuse Aysnc\\AI\\LlmEval\\LlmEval;\n\n$dataset = Dataset::fromArray([\n    ['prompt' =\u003e 'What is 2+2? Reply with just the number.', 'expected' =\u003e '4'],\n    ['prompt' =\u003e 'What is the capital of France? Reply with just the city name.', 'expected' =\u003e 'Paris'],\n    ['prompt' =\u003e 'Is the sky blue? Reply with just yes or no.', 'expected' =\u003e 'yes'],\n]);\n\nreturn LlmEval::create('quick-start')\n    -\u003edataset($dataset)\n    -\u003eassertions(function ($expect, $testCase): void {\n        $expect-\u003econtains($testCase-\u003egetExpected(), caseSensitive: false);\n    });\n```\n\n**2. Run it:**\n\n```bash\nvendor/bin/llm-eval run\n```\n\n```\nLLM-Eval Runner\n===============\n\n  PASS quick-start - Case 0\n  PASS quick-start - Case 1\n  PASS quick-start - Case 2\n\nSummary\n-------\n  Total       3\n  Passed      3\n  Failed      0\n  Pass Rate   100.0%\n  Duration    1.24s\n```\n\nThe config provides the LLM provider, the eval file defines what to test — no `-\u003eprovider()` or `-\u003erunAll()` needed in the file.\n\n## Core Concepts\n\nAn evaluation has three parts: a **provider** (which LLM to call), a **dataset** (prompts + expected answers), and **assertions** (how to check the response).\n\n### Datasets\n\nA dataset is a collection of test cases. Each test case has a `prompt` and an optional `expected` value.\n\n```php\n// Inline array\n$dataset = Dataset::fromArray([\n    ['prompt' =\u003e 'What is 2+2?', 'expected' =\u003e '4'],\n]);\n\n// CSV file (columns: prompt, expected)\n$dataset = Dataset::fromCsv(__DIR__ . '/data/capitals.csv');\n\n// JSON file (array of objects with prompt + expected keys)\n$dataset = Dataset::fromJson(__DIR__ . '/data/questions.json');\n```\n\nThe `expected` key can be a single value or multiple named values:\n\n```php\n// Single — accessed via $testCase-\u003egetExpected()\n['prompt' =\u003e 'What is 2+2?', 'expected' =\u003e '4']\n\n// Multiple — accessed via $testCase-\u003egetExpected('name'), $testCase-\u003egetExpected('age')\n['prompt' =\u003e 'Return JSON with name and age.', 'expected' =\u003e ['name' =\u003e 'Alice', 'age' =\u003e '30']]\n```\n\nCSV files use column prefixes for multiple values: `expected_name`, `expected_age`.\n\nAny keys that aren't `prompt` or `expected` become metadata, accessible via `$testCase-\u003egetData('key')`.\n\n### Assertions\n\nAssertions define what \"correct\" means for a response. You chain them inside the `assertions()` callback.\n\n**Text**\n\n```php\n$expect-\u003econtains('Paris');\n$expect-\u003econtains('paris', caseSensitive: false);\n$expect-\u003enotContains('London');\n$expect-\u003ematchesRegex('/\\d{4}-\\d{2}-\\d{2}/');\n$expect-\u003eminLength(10);\n$expect-\u003emaxLength(500);\n```\n\n**JSON**\n\n```php\n$expect-\u003eisJson();\n```\n\n**Custom**\n\n```php\n$expect-\u003eassert(new MyCustomAssertion());\n```\n\nThere are also assertions for [tool calls](#tool-call-testing), [multi-turn conversations](#multi-turn-conversations), and [LLM-as-judge](#llm-as-judge) — covered in the sections below.\n\n## Testing Scenarios\n\n### Structured Output\n\nValidate that the LLM returns well-formed JSON with the right content. Combine `isJson()` with `contains()` or multiple expected values.\n\n```php\n// evals/json-output.php\n$dataset = Dataset::fromArray([\n    [\n        'prompt' =\u003e 'Return a JSON object with keys \"name\" and \"age\". Use name \"Alice\" and age 30. Only output JSON.',\n        'expected' =\u003e ['name' =\u003e 'Alice', 'age' =\u003e '30'],\n    ],\n    [\n        'prompt' =\u003e 'Return a JSON array of three colors: red, green, blue. Only output JSON.',\n        'expected' =\u003e 'red',\n    ],\n]);\n\nreturn LlmEval::create('json-output')\n    -\u003edataset($dataset)\n    -\u003eassertions(function ($expect, $testCase): void {\n        $expect-\u003eisJson()\n            -\u003econtains($testCase-\u003egetExpected())\n            -\u003econtains($testCase-\u003egetExpected('name'));\n    });\n```\n\n### Tool Call Testing\n\nTest that your LLM calls tools with the right parameters — without executing a full conversation loop. This uses `LlmEval::create()` (not `createConversation`) since you're only checking the first response.\n\n```php\n// evals/tool-test.php\n$tools = [\n    [\n        'name' =\u003e 'get_weather',\n        'description' =\u003e 'Get weather for a location',\n        'input_schema' =\u003e [\n            'type' =\u003e 'object',\n            'properties' =\u003e [\n                'location' =\u003e ['type' =\u003e 'string'],\n            ],\n            'required' =\u003e ['location'],\n        ],\n    ],\n];\n\nreturn LlmEval::create('tool-test')\n    -\u003eoption('tools', $tools)\n    -\u003edataset($dataset)\n    -\u003eassertions(function ($expect): void {\n        $expect-\u003ecalledTool('get_weather');\n        $expect-\u003etoolCallHasParam('get_weather', 'location', 'Paris');\n    });\n```\n\n**Available tool call assertions:**\n\n```php\n$expect-\u003ecalledTool('get_weather');\n$expect-\u003ecalledTool('get_weather', times: 2);\n$expect-\u003etoolCallHasParam('get_weather', 'location');\n$expect-\u003etoolCallHasParam('get_weather', 'location', 'Paris');\n$expect-\u003ecalledToolCount(3);\n$expect-\u003edidNotCallTool('dangerous_function');\n```\n\n### Multi-Turn Conversations\n\nTest agentic workflows where the LLM calls tools, receives results, and continues reasoning. Use `LlmEval::createConversation()` with a tool executor that returns simulated results.\n\n```php\n// evals/math-agent.php\nuse Aysnc\\AI\\LlmEval\\Dataset\\Dataset;\nuse Aysnc\\AI\\LlmEval\\LlmEval;\nuse Aysnc\\AI\\LlmEval\\Providers\\CallableToolExecutor;\nuse Aysnc\\AI\\LlmEval\\Providers\\ToolCall;\nuse Aysnc\\AI\\LlmEval\\Providers\\ToolResult;\n\n$tools = [\n    [\n        'name' =\u003e 'calculate',\n        'description' =\u003e 'Evaluate a math expression',\n        'input_schema' =\u003e [\n            'type' =\u003e 'object',\n            'properties' =\u003e [\n                'expression' =\u003e ['type' =\u003e 'string'],\n            ],\n            'required' =\u003e ['expression'],\n        ],\n    ],\n];\n\n$executor = new CallableToolExecutor([\n    'calculate' =\u003e function (ToolCall $tc): ToolResult {\n        $expr = $tc-\u003egetParam('expression');\n        $result = match ($expr) {\n            '6 * 7', '6*7' =\u003e '42',\n            default =\u003e 'unknown',\n        };\n\n        return new ToolResult($tc-\u003eid, $result);\n    },\n]);\n\n$dataset = Dataset::fromArray([\n    ['prompt' =\u003e 'Use the calculate tool to compute 6 * 7.', 'expected' =\u003e '42'],\n]);\n\nreturn LlmEval::createConversation('math-agent')\n    -\u003ewithTools($tools)\n    -\u003eexecutor($executor)\n    -\u003edataset($dataset)\n    -\u003eassertions(function ($expect, $testCase): void {\n        $expect-\u003econtains($testCase-\u003egetExpected())\n            -\u003eusedTool('calculate')\n            -\u003eturnCount(2);\n    });\n```\n\n**Available conversation assertions:**\n\n```php\n$expect-\u003eturnCount(2);\n$expect-\u003eusedTool('calculate');\n$expect-\u003econversationContains('42');\n```\n\n#### Multi-Turn Datasets\n\nUse a `turns` array to test conversations with multiple user messages. Each turn has its own `prompt` and optional `expected` values for per-turn assertions. Use `getTurn()` to access the 1-indexed turn number.\n\n```php\n$dataset = Dataset::fromArray([\n    [\n        'turns' =\u003e [\n            ['prompt' =\u003e 'What is the weather in Paris?', 'expected' =\u003e '22'],\n            ['prompt' =\u003e 'Now check Tokyo', 'expected' =\u003e '18'],\n            ['prompt' =\u003e 'Which city was warmer?', 'expected' =\u003e 'Paris'],\n        ],\n    ],\n]);\n\nreturn LlmEval::createConversation('multi-turn')\n    -\u003ewithTools($tools)\n    -\u003eexecutor($executor)\n    -\u003edataset($dataset)\n    -\u003eassertions(function ($expect, $testCase): void {\n        $expect-\u003econtains($testCase-\u003egetExpected());\n\n        if ($testCase-\u003egetTurn() \u003c= 2) {\n            $expect-\u003eusedTool('get_weather');\n        }\n    });\n```\n\n### LLM-as-Judge\n\nUse one LLM to evaluate another's response quality. Instead of checking for exact strings, you describe what \"good\" looks like and a judge model scores the response 0-100%.\n\n```php\n// evals/quality-check.php\nuse Aysnc\\AI\\LlmEval\\Providers\\AnthropicProvider;\n\n$judge = new AnthropicProvider(getenv('ANTHROPIC_API_KEY'));\n\nreturn LlmEval::create('quality-check')\n    -\u003edataset($dataset)\n    -\u003eassertions(function ($expect) use ($judge): void {\n        $expect-\u003ejudgedBy(\n            judge: $judge,\n            criteria: 'Is this response helpful, accurate, and concise?',\n            threshold: 0.8,\n        );\n    });\n```\n\n#### Judging Conversations\n\nFor multi-turn conversations, you can use `judgedBy()` inside `assertions()` to judge per-turn, or use `-\u003ejudge()` on the eval to run a single evaluation after all turns complete. The judge receives the full conversation history — all messages, tool calls, and results.\n\n```php\nreturn LlmEval::createConversation('multi-turn')\n    -\u003ewithTools($tools)\n    -\u003eexecutor($executor)\n    -\u003edataset($dataset)\n    -\u003eassertions(function ($expect, $testCase): void {\n        $expect-\u003econtains($testCase-\u003egetExpected());\n    })\n    -\u003ejudge($judge, 'Did the model correctly identify the warmer city based on the earlier temperatures?');\n```\n\n## CLI Runner\n\n```bash\n# Run all eval files in the evals directory\nvendor/bin/llm-eval run\n\n# Run a specific eval file\nvendor/bin/llm-eval run my-test\n\n# Run in parallel\nvendor/bin/llm-eval run --parallel --concurrency=10\n\n# Verbose mode — shows judge reasoning and tool calls for passing tests\nvendor/bin/llm-eval run -v\n\n# JSON output\nvendor/bin/llm-eval run --format=json\n\n# Clear response cache\nvendor/bin/llm-eval cache:clear\n\n# Scaffold a new eval file\nvendor/bin/llm-eval init\n```\n\n### Output\n\n```\nLLM-Eval Runner\n===============\n\nRunning evaluations...\n\n  PASS simple - Case 0\n  PASS simple - Case 1\n  FAIL simple - Case 2\n       Got: \"The sky appears blue due to Rayleigh scattering...\"\n       → Text does not contain \"yes\"\n  PASS conversation-json - compare-two-cities - Turn 1\n  PASS conversation-json - compare-two-cities - Turn 2\n  PASS conversation-json - compare-two-cities - Turn 3\n       → Score: 100% (threshold: 70%) - The response correctly identifies Paris as the warmer city.\n  PASS llm-judge - photosynthesis\n       → Score: 95% (threshold: 70%) - Clear, accurate explanation mentioning plants and sunlight.\n\nSummary\n-------\n  Total       7\n  Passed      6\n  Failed      1\n  Pass Rate   85.7%\n  Duration    4.32s\n```\n\nWith `-v`, passing tests also show judge scores and tool call details.\n\n## Providers\n\n### Anthropic Claude\n\nDirect API access. Get your key at [console.anthropic.com](https://console.anthropic.com).\n\n```php\n$provider = new AnthropicProvider(\n    apiKey: getenv('ANTHROPIC_API_KEY'),\n);\n```\n\nDefault model: `claude-sonnet-4-20250514`\n\n### AWS Bedrock\n\nUses the Converse API — works with Claude, Titan, Llama, Mistral, and other Bedrock models. Requires `composer require aws/aws-sdk-php`. See [AWS Bedrock docs](https://docs.aws.amazon.com/bedrock/).\n\n```php\nuse Aysnc\\AI\\LlmEval\\Providers\\BedrockProvider;\n\n// Explicit credentials\n$provider = new BedrockProvider(\n    region: 'us-east-1',\n    accessKeyId: 'AKIA...',\n    secretAccessKey: 'secret...',\n);\n\n// Or default credential chain (env vars, ~/.aws/credentials, IAM role)\n$provider = new BedrockProvider(region: 'us-east-1');\n```\n\nDefault model: `anthropic.claude-3-5-sonnet-20241022-v2:0`\n\n### Changing the Model\n\nUse `-\u003emodel()` to override the default model for any provider:\n\n```php\nreturn LlmEval::create('eval-name')\n    -\u003emodel('claude-opus-4-20250514')\n    -\u003edataset($dataset)\n    -\u003eassertions($assertions);\n```\n\nThis works with both `AnthropicProvider` (Anthropic model IDs) and `BedrockProvider` (Bedrock model IDs).\n\nYou can also set `-\u003emaxTokens(2048)` to override the default max tokens (1024).\n\n## Programmatic API\n\nIf you need to run evals from PHP code — inside a test suite, a CI script, or anywhere you want to work with the results directly — use `-\u003eprovider()` and `-\u003erunAll()`:\n\n```php\n$provider = new AnthropicProvider(getenv('ANTHROPIC_API_KEY'));\n\n$results = LlmEval::create('quick-start')\n    -\u003eprovider($provider)\n    -\u003edataset($dataset)\n    -\u003eassertions(function ($expect, $testCase): void {\n        $expect-\u003econtains($testCase-\u003egetExpected());\n    })\n    -\u003erunAll();\n\necho \"Pass rate: {$results-\u003epassRatePercent()}\\n\";\n// Pass rate: 100.0%\n```\n\n## Requirements\n\n- PHP 8.3+\n- `guzzlehttp/guzzle` ^7.10\n- `aws/aws-sdk-php` ^3.0 (optional, for Bedrock)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAysnc-Labs%2Fllm-eval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAysnc-Labs%2Fllm-eval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAysnc-Labs%2Fllm-eval/lists"}