https://github.com/hamelsmu/evals-skills

Skills for AI Evals to compliment the course: AI Evals For Engineers & PMs
https://github.com/hamelsmu/evals-skills

Last synced: 3 months ago
JSON representation

Skills for AI Evals to compliment the course: AI Evals For Engineers & PMs

Host: GitHub
URL: https://github.com/hamelsmu/evals-skills
Owner: hamelsmu
License: mit
Created: 2026-03-01T21:20:09.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-03-03T23:15:08.000Z (5 months ago)
Last Synced: 2026-03-28T01:16:05.733Z (4 months ago)
Homepage: https://maven.com/parlance-labs/evals?promoCode=evals-info-url
Size: 56.6 KB
Stars: 999
Watchers: 9
Forks: 104
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-agent-skills - evals-skills - Evaluation skills for testing AI systems, reviewing outputs, and designing repeatable eval workflows. (Testing and QA Skills)
awesome-claude-code - **evals-skills** - (1.4k ⭐) - Skills for AI evaluation workflows and the AI Evals for Engineers course. (🧠 Agent Skills)

README

# Eval Skills for AI Coding Agents

Skills that guide AI coding agents to help you build LLM evaluations.

These skills guard against common mistakes I've seen helping 50+ companies and teaching students in our [AI Evals course](https://maven.com/parlance-labs/evals?promoCode=evals-info-url). If you're new to evals, see [questions.md](questions.md) for free resources on the fundamentals.

## New to Evals? Start Here

If you are new to evals, start with the `eval-audit` skill. Give your coding agent these instructions:

> Install the eval skills plugin from https://github.com/hamelsmu/evals-skills, then run /evals-skills:eval-audit on my eval pipeline. Investigate each diagnostic area using a separate subagent in parallel, then synthesize the findings into a single report. Use other skills in the plugin as recommended by the audit.

The audit isn't a complete solution, but it will catch common problems we've seen in evals. It will also recommend other skills to use to fix the problems.

## Installation

In Claude Code, run these two commands:

```bash
# Step 1: Register the plugin repository
/plugin marketplace add hamelsmu/evals-skills

# Step 2: Install the plugin
/plugin install evals-skills@hamelsmu-evals-skills
```

To upgrade:

```bash
/plugin update evals-skills@hamelsmu-evals-skills
```

After installation, restart Claude Code. The skills will appear as `/evals-skills:`.

## Installation (npx skills)

If you use the open Skills CLI, install from this repo with:

```bash
npx skills add https://github.com/hamelsmu/evals-skills
```

Install one skill only:

```bash
npx skills add https://github.com/hamelsmu/evals-skills --skill eval-audit
```

Check for updates:

```bash
npx skills check
npx skills update
```

## Available Skills

| Skill | What it does |
|-------|-------------|
| eval-audit | Audit an eval pipeline and surface problems with prioritized severity |
| error-analysis | Guide the user through reading traces and categorizing failures |
| generate-synthetic-data | Create diverse synthetic test inputs using dimension-based tuple generation |
| write-judge-prompt | Design LLM-as-Judge evaluators for subjective quality criteria |
| validate-evaluator | Calibrate LLM judges against human labels using data splits, TPR/TNR, and bias correction |
| evaluate-rag | Evaluate retrieval and generation quality in RAG pipelines |
| build-review-interface | Build custom annotation interfaces for human trace review |

Invoke a skill with `/evals-skills:skill-name`, e.g., `/evals-skills:error-analysis`.

## Write Your Own Skills

These skills are a starting point and only encode common mistakes that generalize across projects. Skills grounded in your stack, your domain, and your data will outperform them. Start here, then write your own.

The [meta-skill](meta-skill.md) can help you ground custom skills.

## Beyond These Skills

These skills handle the parts of eval work that generalize across projects. Much of the process doesn't: production monitoring, CI/CD integration, data analysis, and much more. The [course](https://maven.com/parlance-labs/evals?promoCode=evals-info-url) covers all of it.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hamelsmu/evals-skills

Awesome Lists containing this project

README