https://github.com/evaliphy/evaliphy

The E2E AI testing tool | No ML Overhead
https://github.com/evaliphy/evaliphy

ai ai-test-automation ai-testing ai-testing-tool end-to-end-testing llm-evaluation llm-evaluation-framework llm-evaluation-toolkit llm-testing rag rag-evaluation rag-pipeline test-automation test-automation-framework testing-tools

Last synced: about 2 months ago
JSON representation

The E2E AI testing tool | No ML Overhead

Host: GitHub
URL: https://github.com/evaliphy/evaliphy
Owner: Evaliphy
License: mit
Created: 2026-04-03T08:04:27.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-05-05T10:02:59.000Z (3 months ago)
Last Synced: 2026-05-05T11:32:40.194Z (3 months ago)
Topics: ai, ai-test-automation, ai-testing, ai-testing-tool, end-to-end-testing, llm-evaluation, llm-evaluation-framework, llm-evaluation-toolkit, llm-testing, rag, rag-evaluation, rag-pipeline, test-automation, test-automation-framework, testing-tools
Language: TypeScript
Homepage: https://evaliphy.com
Size: 33.1 MB
Stars: 16
Watchers: 1
Forks: 9
Open Issues: 16
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

awesome-testing - Evaliphy - Test your AI system end-to-end with Evaliphy. It uses a Playwright-style testing approach and generates HTML reports. (Software / AI & LLM Testing)

README

          # Evaliphy



  

  







  E2E AI system testing tool





  

  

  

  








  Quick start · Assertions · LLM Providers · CI Integration · Project Structure 



---

> ⭐️ Star to stay updated. [Contributions welcome!](#contributing)

---

Evaliphy is an AI system tool that treats your AI system as a black box. Write assertions against your real API, get structured results, and catch regressions in CI — without touching your pipeline internals or writing prompt engineering from scratch.

Built-in LLM-as-Judge assertions handle the hard parts. You focus on writing evaluations, not wiring up models.

![Evaliphy Demo](./docs/gif/demo.gif)

---

## Prerequisites

- Node JS 24.0.0 or higher

- An OpenAI API key or any OpenAI-compatible provider

- A running AI application with an HTTP endpoint

---

## Quick start

### 1. Install and initialise

```bash

npm install -g @evaliphy/sdk

evaliphy init my-eval-project

cd my-eval-project

npm install

```

### 2. Set your environment variables

```bash

cp .env.example .env

```

Add your API key to `.env`:

```

OPENAI_API_KEY=your-api-key-here

```

### 3. Configure Evaliphy

Open `evaliphy.config.ts` and point it at your AI application:

```typescript

import { defineConfig } from "@evaliphy/sdk";

export default defineConfig({

  http: {

    baseUrl: "https://api.your-service.com",

    timeout: 10_000,

    headers: {

      Authorization: `Bearer ${process.env.API_KEY}`,

    },

  },

  llmAsJudgeConfig: {

    model: "gpt-4o-mini",

    provider: {

      type: "openai",

      apiKey: process.env.OPENAI_API_KEY,

    },

  },

  reporters: ["console", "html"],

});

```

### 4. Write your first evaluation

Create `evals/chat.eval.ts`:

```typescript

import { evaluate, expect } from "@evaliphy/sdk";

const sample = {

  query: "What is the return policy?",

  expectedContext: "Items can be returned within 30 days."

};

evaluate("Return Policy Chat", async ({ httpClient }) => {

  // 1. Hit your RAG endpoint

  const res = await httpClient.post('/api/chat', { message: sample.query });

  const data = await res.json();

  // 2. Assert in plain English

  await expect({

    query: sample.query,

    context: sample.expectedContext,

    response: data.answer

  }).toBeFaithful();

  // Or use positional arguments for simplicity

  await expect(sample.query, sample.expectedContext, data.answer).toBeRelevant({ threshold: 0.7 });

});

```

### 5. Run your evaluations

```bash

evaliphy eval

```

---

## Assertions

### LLM assertions

Scored 0.0 to 1.0 by a configurable judge model. Pass if the score meets or exceeds the threshold.

| Assertion        | What it checks                                |

| ---------------- | --------------------------------------------- |

| `toBeFaithful()` | Response is grounded in the retrieved context |

| `toBeRelevant()` | Response addresses the query                  |

| `toBeGrounded()` | Claims are supported by source documents      |

| `toBeCoherent()` | Response is logically consistent              |

| `toBeHarmless()` | Response contains no harmful or toxic content |

All LLM assertions accept an optional config object:

```typescript

await expect({ query, response, context }).toBeFaithful({

  threshold: 0.9, // override global threshold for this assertion

});

```

### Deterministic assertions

Coming in v1. Fast, free, no LLM call required.

---

## Configuration reference

| Field                         | Type   | Default       | Description                     |

| ----------------------------- | ------ | ------------- | ------------------------------- |

| `http.baseUrl`                | string | —             | Base URL of your AI application |

| `http.timeout`                | number | `10000`       | Request timeout in ms           |

| `http.headers`                | object | `{}`          | Headers sent with every request |

| `llmAsJudgeConfig.model`      | string | `gpt-4o-mini` | Judge model                     |

| `llmAsJudgeConfig.threshold`  | number | `0.7`         | Global pass threshold           |

| `llmAsJudgeConfig.promptsDir` | string | —             | Path to custom prompt directory |

| `reporters`                   | array  | `['console']` | Output formats                  |

---

## Supported LLM Providers

Evaliphy uses the [Vercel AI SDK](https://sdk.vercel.ai) under the hood, which means it supports a wide range of LLM providers out of the box. Configure your provider once in `evaliphy.config.ts` and Evaliphy handles the rest.

| Provider | Type key | Required field |

|---|---|---|

| OpenAI | `openai` | `apiKey` |

| Anthropic | `anthropic` | `apiKey` |

| Azure OpenAI | `azure` | `apiKey`, `resourceName` |

| Google Gemini | `google` | `apiKey` |

| Mistral | `mistral` | `apiKey` |

| OpenAI-compatible gateway | `gateway` | `apiKey`, `url` |

### OpenAI

```typescript

llmAsJudgeConfig: {

  model: 'gpt-4o-mini',

  provider: {

    type: 'openai',

    apiKey: process.env.OPENAI_API_KEY,

  }

}

```

### Anthropic

```typescript

llmAsJudgeConfig: {

  model: 'claude-3-5-haiku-20241022',

  provider: {

    type: 'anthropic',

    apiKey: process.env.ANTHROPIC_API_KEY,

  }

}

```

### OpenAI-compatible gateway (OpenRouter, LiteLLM, etc.)

```typescript

llmAsJudgeConfig: {

  model: 'gpt-4o-mini',

  provider: {

    type: 'gateway',

    url: 'https://openrouter.ai/api/v1',

    apiKey: process.env.OPENROUTER_API_KEY,

  }

}

```

### Azure OpenAI

```typescript

llmAsJudgeConfig: {

  model: 'gpt-4o-mini',

  provider: {

    type: 'azure',

    resourceName: process.env.AZURE_RESOURCE_NAME,

    apiKey: process.env.AZURE_API_KEY,

  }

}

```

Any provider supported by the Vercel AI SDK can be used with Evaliphy. See the [Vercel AI SDK provider documentation](https://sdk.vercel.ai/providers/ai-sdk-providers) for the full list.

---

## Custom prompts

Evaliphy ships with built-in prompts for every assertion. Override any of them by creating a markdown file in your prompts directory and pointing `promptsDir` at it.

```

my-eval-project/

  prompts/

    faithfulness.md    ← overrides built-in faithfulness prompt

```

```typescript

llmAsJudgeConfig: {

  promptsDir: "./prompts";

}

```

Each prompt file uses frontmatter to declare its input variables:

```markdown

---

name: faithfulness

input_variables:

  - question

  - context

  - response

---

You are evaluating a RAG system for a UK e-commerce company.

Faithfulness means every claim traces back to the retrieved context.

## Question

{{question}}

## Context

{{context}}

## Response

{{response}}

```

See the [custom prompts guide](https://evaliphy.com/docs/llm-as-judge#using-custom-prompts) for full documentation.

---

## CI integration

Evaliphy exits with a non-zero code when any assertion fails, making it compatible with any CI pipeline.

### GitHub Actions

```yaml

name: Evaliphy

on: [push, pull_request]

jobs:

  eval:

    runs-on: ubuntu-latest

    steps:

      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4

        with:

          node-version: 20

      - run: npm ci

      - run: evaliphy eval

        env:

          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

          API_KEY: ${{ secrets.API_KEY }}

```

---

## Reporters

| Reporter  | Output       | Description                                   |

| --------- | ------------ | --------------------------------------------- |

| `console` | Terminal     | Streams results as tests run                  |

| `json`    | `.json` file | Machine-readable, good for CI pipelines       |

| `html`    | `.html` file | Self-contained visual report                  |

| `csv`     | `.csv` file  | Coming Soon                       |

| `xlsx`    | `.xlsx` file | Coming Soon |

Configure in `evaliphy.config.ts`:

---

## How it works

1. Your eval file makes an HTTP call to your real running API

2. The response and context are passed to the assertion

3. The assertion sends a rendered prompt to the judge model

4. The judge scores the response 0.0 to 1.0

5. The score is compared against the threshold — pass or fail

6. Results are written to all configured reporters

---

## Why Evaliphy

**It fits where your tests already live.** Eval files are TypeScript files that sit in your repo alongside your other tests. No Python notebooks, no complex setup, no new workflow to learn.

**You test your real API.** Evaliphy makes HTTP calls to your actual running service — not a mocked response or an offline dataset. If your AI system breaks in production, Evaliphy catches it.

**The judges are built in.** Faithfulness, relevance, groundedness — the assertions that matter are shipped with the framework. No prompt writing or LLM wiring required.

**Configurable when you need it.** Sensible defaults out of the box. Override the judge model globally, per file, or per assertion. Bring your own prompts for domain-specific evaluation.

---

## Project structure

After running `evaliphy init`, your project looks like this:

```

my-eval-project/

  evals/

    example.eval.ts       — sample evaluation to get you started

  prompts/                — optional custom prompt overrides

  evaliphy.config.ts      — main configuration file

  .env.example            — environment variable template

  package.json

  tsconfig.json

```

---

## Beta

Evaliphy is in open beta. The API may change between versions. We are looking for feedback from engineers and teams building AI applications.

- Free for commercial use during beta

- Influence the v1.0 roadmap directly

- Contribute to the growing assertion library

[Submit feedback](https://forms.gle/9ztrqUCXUg2YGSJJA)

---

## Contributing

Contributions are welcome. Please read the [contributing guide](./CONTRIBUTING.md) before opening a pull request.

---

## Built by the community



  



---

## License

MIT © [Evaliphy](https://github.com/evaliphy/evaliphy)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/evaliphy/evaliphy

Awesome Lists containing this project

README