{"id":48291024,"url":"https://github.com/mustafaautomation/llm-testing-toolkit","last_synced_at":"2026-04-04T23:06:10.668Z","repository":{"id":343044989,"uuid":"1164955244","full_name":"mustafaautomation/llm-testing-toolkit","owner":"mustafaautomation","description":"Provider-agnostic LLM testing framework — regression, hallucination, quality, and toxicity evaluators for OpenAI, Anthropic, and custom APIs","archived":false,"fork":false,"pushed_at":"2026-03-08T15:34:12.000Z","size":359,"stargazers_count":0,"open_issues_count":10,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-08T19:33:01.028Z","etag":null,"topics":["ai","ai-testing","anthropic","ci-cd","evaluation","hallucination-detection","llm","nlp","openai","qa-automation","regression-testing","test-automation","testing","toxicity-detection","typescript"],"latest_commit_sha":null,"homepage":"https://quvantic.com","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mustafaautomation.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-23T17:06:35.000Z","updated_at":"2026-03-08T15:32:32.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mustafaautomation/llm-testing-toolkit","commit_stats":null,"previous_names":["mustafaautomation/llm-testing-toolkit"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/mustafaautomation/llm-testing-toolkit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mustafaautomation%2Fllm-testing-toolkit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mustafaautomation%2Fllm-testing-toolkit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mustafaautomation%2Fllm-testing-toolkit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mustafaautomation%2Fllm-testing-toolkit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mustafaautomation","download_url":"https://codeload.github.com/mustafaautomation/llm-testing-toolkit/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mustafaautomation%2Fllm-testing-toolkit/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31418288,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T20:09:54.854Z","status":"ssl_error","status_checked_at":"2026-04-04T20:09:44.350Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ai-testing","anthropic","ci-cd","evaluation","hallucination-detection","llm","nlp","openai","qa-automation","regression-testing","test-automation","testing","toxicity-detection","typescript"],"created_at":"2026-04-04T23:06:09.802Z","updated_at":"2026-04-04T23:06:10.651Z","avatar_url":"https://github.com/mustafaautomation.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# llm-testing-toolkit\n\n[![CI](https://github.com/mustafaautomation/llm-testing-toolkit/actions/workflows/llm-tests.yml/badge.svg)](https://github.com/mustafaautomation/llm-testing-toolkit/actions)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\n[![Node.js](https://img.shields.io/badge/Node.js-18+-339933.svg?logo=node.js\u0026logoColor=white)](https://nodejs.org)\n[![TypeScript](https://img.shields.io/badge/TypeScript-strict-3178c6.svg?logo=typescript\u0026logoColor=white)](https://www.typescriptlang.org)\n[![Docker](https://img.shields.io/badge/Docker-Ready-2496ED.svg?logo=docker\u0026logoColor=white)](Dockerfile)\n\nA provider-agnostic LLM testing framework for evaluating AI model outputs with regression, hallucination, quality, and toxicity checks. Built with TypeScript, designed for CI/CD pipelines.\n\n---\n\n## Table of Contents\n\n- [Why?](#why)\n- [Demo](#demo)\n- [Quick Start](#quick-start)\n- [Architecture](#architecture)\n- [Features](#features)\n- [CLI](#cli)\n- [Programmatic API](#programmatic-api)\n- [CI/CD Integration](#cicd-integration)\n- [Project Structure](#project-structure)\n- [Development](#development)\n\n---\n\n## Why?\n\nLLM outputs are non-deterministic. Traditional testing approaches don't work. This toolkit provides structured evaluation strategies to catch:\n\n- **Regressions** -- responses drifting from expected baselines after model updates\n- **Hallucinations** -- fabricated facts not grounded in source material\n- **Quality issues** -- irrelevant, incoherent, or poorly formatted responses\n- **Safety violations** -- toxic content, PII leakage, blocked terms\n\n---\n\n## Demo\n\n```\n$ npm test\n\n ✓ tests/unit/similarity.test.ts (21 tests) 4ms\n ✓ tests/unit/regression.test.ts (7 tests) 3ms\n ✓ tests/unit/hallucination.test.ts (4 tests) 3ms\n ✓ tests/unit/toxicity.test.ts (10 tests) 3ms\n ✓ tests/unit/quality.test.ts (11 tests) 3ms\n\n Test Files  5 passed (5)\n      Tests  53 passed (53)\n   Duration  339ms\n```\n\n\u003e **53 unit tests** covering all 4 evaluators and similarity utilities. Tests run in under 400ms.\n\n---\n\n## Quick Start\n\n```bash\n# Install\nnpm install llm-testing-toolkit\n\n# Initialize config\nnpx llm-test init\n\n# Set API keys (see .env.example)\nexport OPENAI_API_KEY=sk-...\n\n# Run tests\nnpx llm-test run\n```\n\n---\n\n## Architecture\n\n```\n┌─────────────────────────────────────────────────────┐\n│                    CLI / API                         │\n├─────────────────────────────────────────────────────┤\n│                  Test Runner                         │\n│          (parallel execution, retries)               │\n├──────────┬──────────┬──────────┬────────────────────┤\n│ Regression│Hallucin. │ Quality  │    Toxicity        │\n│ Evaluator │Evaluator │Evaluator │    Evaluator       │\n├──────────┴──────────┴──────────┴────────────────────┤\n│              Provider Layer (fetch-based)             │\n│         OpenAI  │  Anthropic  │  Custom HTTP         │\n├─────────────────────────────────────────────────────┤\n│              Reporters                               │\n│         Console  │  JSON  │  HTML                    │\n└─────────────────────────────────────────────────────┘\n```\n\n---\n\n## Features\n\n### Provider-Agnostic\n\nTest any LLM through a unified interface. No SDK lock-in -- uses raw `fetch`.\n\n```typescript\nimport { ToolkitConfig } from 'llm-testing-toolkit';\n\nconst config: ToolkitConfig = {\n  defaultProvider: 'openai',\n  providers: {\n    openai: {\n      type: 'openai',\n      apiKey: '$OPENAI_API_KEY',\n      model: 'gpt-4o-mini',\n    },\n    anthropic: {\n      type: 'anthropic',\n      apiKey: '$ANTHROPIC_API_KEY',\n      model: 'claude-sonnet-4-6',\n    },\n    custom: {\n      type: 'custom',\n      endpoint: 'https://your-api.com/v1/chat',\n      headers: { Authorization: 'Bearer $CUSTOM_API_KEY' },\n    },\n  },\n  suites: [],\n  reporters: [{ type: 'console' }, { type: 'html' }],\n};\n```\n\n### Regression Testing\n\nCompare LLM responses against saved baselines using semantic similarity.\n\n```typescript\n{\n  name: 'greeting-consistency',\n  prompt: 'Greet the user in a friendly way.',\n  evaluators: [{\n    type: 'regression',\n    options: {\n      similarityThreshold: 0.85,\n      keyPhraseThreshold: 0.7,\n      mode: 'combined', // 'exact' | 'semantic' | 'combined'\n    },\n  }],\n  baseline: 'Hello! How can I help you today?',\n}\n```\n\nUpdate baselines automatically:\n```bash\nnpx llm-test run --update-baselines\n```\n\n### Hallucination Detection\n\nVerify responses stay grounded in source material.\n\n```typescript\n{\n  name: 'grounded-summary',\n  prompt: 'Summarize the provided context.',\n  evaluators: [{\n    type: 'hallucination',\n    options: { groundingThreshold: 0.7 },\n  }],\n  context: 'TypeScript is developed by Microsoft...',\n}\n```\n\nEvaluates:\n- Claim extraction from response\n- Per-claim grounding score against source documents\n- Contradiction detection\n\n### Quality Evaluation\n\nMulti-dimensional response quality scoring.\n\n```typescript\n{\n  name: 'json-output',\n  prompt: 'Return a JSON user profile.',\n  evaluators: [{\n    type: 'quality',\n    options: {\n      expectedFormat: 'json',\n      jsonSchema: { required: ['name', 'email'] },\n      relevanceThreshold: 0.6,\n    },\n  }],\n}\n```\n\nScores: relevance, coherence, format compliance, completeness.\n\n### Toxicity \u0026 Safety\n\nDetect harmful content and PII leakage.\n\n```typescript\n{\n  name: 'safe-response',\n  prompt: 'Explain password security.',\n  evaluators: [{\n    type: 'toxicity',\n    options: {\n      sensitivity: 'high',\n      checkPII: true,\n      customBlocklist: ['company-secret'],\n    },\n  }],\n}\n```\n\nDetects: blocked terms, email addresses, phone numbers, SSNs, credit card numbers, IP addresses.\n\n---\n\n## CLI\n\n```bash\n# Run all test suites\nnpx llm-test run\n\n# Run specific suite\nnpx llm-test run --suite regression\n\n# Multiple reporters\nnpx llm-test run --reporter console,html,json\n\n# Custom config path\nnpx llm-test run --config ./my-config.ts\n\n# Update regression baselines\nnpx llm-test run --update-baselines\n\n# Verbose output\nnpx llm-test run -v\n\n# Initialize project\nnpx llm-test init\n```\n\n---\n\n## Programmatic API\n\n```typescript\nimport {\n  RegressionEvaluator,\n  HallucinationEvaluator,\n  QualityEvaluator,\n  ToxicityEvaluator,\n} from 'llm-testing-toolkit';\n\n// Use evaluators directly\nconst regression = new RegressionEvaluator({ similarityThreshold: 0.8 });\nconst result = regression.evaluate(actualResponse, baselineResponse);\nconsole.log(result.passed, result.score);\n\n// Full test runner\nimport { TestRunner, loadConfig } from 'llm-testing-toolkit';\nconst config = await loadConfig();\nconst runner = new TestRunner(config);\nconst results = await runner.run();\n```\n\n---\n\n## CI/CD Integration\n\nThe included GitHub Actions workflow:\n\n1. Runs lint, format, type check on Node 18 \u0026 20\n2. Executes unit tests with full validation\n3. Builds the package to verify publishability\n\nAdd your API keys as repository secrets for LLM evaluation:\n- `OPENAI_API_KEY`\n- `ANTHROPIC_API_KEY`\n\n---\n\n## Reports\n\n### Console\nColorized terminal output with pass/fail indicators and score breakdowns.\n\n### HTML\nSingle-file visual report with expandable test details, scores, and prompt/response diffs. Dark theme, zero external dependencies.\n\n### JSON\nMachine-readable output for programmatic analysis and dashboards.\n\n---\n\n## Project Structure\n\n```\nllm-testing-toolkit/\n├── .github/\n│   ├── workflows/llm-tests.yml    # CI pipeline (Node 18/20 matrix)\n│   ├── dependabot.yml             # Automated dependency updates\n│   ├── CODEOWNERS                 # Review ownership\n│   └── pull_request_template.md   # PR checklist\n├── src/\n│   ├── providers/                 # LLM provider adapters\n│   │   ├── base.provider.ts       # Abstract base with timedCall\n│   │   ├── openai.provider.ts     # OpenAI chat completions\n│   │   ├── anthropic.provider.ts  # Anthropic messages API\n│   │   └── custom.provider.ts     # Any HTTP-based LLM\n│   ├── evaluators/                # Evaluation strategies\n│   │   ├── regression.evaluator.ts    # Baseline comparison\n│   │   ├── hallucination.evaluator.ts # Grounding verification\n│   │   ├── quality.evaluator.ts       # Multi-dimensional scoring\n│   │   └── toxicity.evaluator.ts      # Safety \u0026 PII detection\n│   ├── reporters/                 # Output formatters\n│   │   ├── console.reporter.ts    # Colorized terminal output\n│   │   ├── html.reporter.ts       # Dark theme HTML report\n│   │   └── json.reporter.ts       # Machine-readable JSON\n│   ├── core/                      # Framework core\n│   │   ├── runner.ts              # Test runner (parallel, retries)\n│   │   ├── config.ts              # Config loader + env resolution\n│   │   └── suite.ts               # Type definitions\n│   ├── utils/                     # Shared utilities\n│   │   ├── similarity.ts          # Cosine, Levenshtein, Dice\n│   │   └── logger.ts              # Colored structured logging\n│   ├── cli.ts                     # Command-line interface\n│   └── index.ts                   # Public API exports\n├── tests/unit/                    # 53 unit tests\n├── examples/                      # Example test suite configs\n├── CONTRIBUTING.md\n├── SECURITY.md\n├── Dockerfile\n└── docker-compose.yml\n```\n\n---\n\n## Development\n\n```bash\ngit clone https://github.com/mustafaautomation/llm-testing-toolkit.git\ncd llm-testing-toolkit\nnpm install\nnpm test              # Run unit tests\nnpm run typecheck     # Type checking\nnpm run lint          # ESLint\nnpm run format:check  # Prettier\nnpm run build         # Compile TypeScript\n```\n\n---\n\n## License\n\nMIT\n\n---\n\nBuilt by [Quvantic](https://quvantic.com)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmustafaautomation%2Fllm-testing-toolkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmustafaautomation%2Fllm-testing-toolkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmustafaautomation%2Fllm-testing-toolkit/lists"}