https://github.com/sayedihashimi/copilot-skill-eval
https://github.com/sayedihashimi/copilot-skill-eval
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/sayedihashimi/copilot-skill-eval
- Owner: sayedihashimi
- License: mit
- Created: 2026-03-17T18:40:27.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-03-26T17:24:56.000Z (2 months ago)
- Last Synced: 2026-03-27T07:20:20.165Z (2 months ago)
- Language: Python
- Size: 48.6 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Copilot Skill Evaluation Framework
Evaluate how **GitHub Copilot custom skills** impact code generation quality. Generate the same apps under different skill configurations, verify they build and run, then produce a weighted comparative analysis — for **any tech stack**.
## Install
```bash
pipx install git+https://github.com/sayedihashimi/copilot-skill-eval
```
> **Prerequisites:** Python 3.10+, the [GitHub Copilot CLI](https://docs.github.com/en/copilot) (`copilot` command), and your tech stack's build tools (e.g., `dotnet`, `npm`, `go`).
>
> Don't have pipx? Install it first:
> ```bash
> pip install pipx
> pipx ensurepath
> ```
> Then restart your terminal.
## How It Works
You define what to evaluate in two files:
| File | Purpose |
|------|---------|
| **`eval.yaml`** | Scenarios (apps to generate), configurations (skill sets to compare), verification commands, and analysis dimensions |
| **`skill-sources.yaml`** | Where to get skills/plugins — local paths or git repos (cloned and cached automatically) |
The framework then runs a three-stage pipeline:
1. **Generate** — Builds the same apps with each skill configuration using Copilot CLI
2. **Verify** — Compiles, format-checks, security-scans, and runs each generated project
3. **Analyze** — Scores quality across weighted dimensions and produces a comparative report
## Quick Start
### 1. Initialize a project
```bash
skill-eval init
```
The interactive wizard walks you through creating `eval.yaml`, `skill-sources.yaml`, and starter scenario prompt files. It asks about your tech stack, scenarios, and skill sources (local or remote git repos).
### 2. Edit your scenario prompts
Scenario prompts in `prompts/scenarios/` describe the apps Copilot should generate. The more specific, the better the comparison. A good prompt includes:
| Section | Purpose |
|---------|---------|
| **Overview** | What the app does in 2–3 sentences |
| **Technology Stack** | Exact frameworks and versions |
| **Entities** | Domain models with fields and relationships |
| **Business Rules** | Non-trivial constraints and logic |
| **Endpoints** | API routes or pages to generate |
See the [Scenario Authoring Guide](docs/authoring-guide.md) for full details and examples.
### 3. Run the evaluation
```bash
skill-eval run
```
Reports land in `reports/`. The main output is `reports/analysis.md` with per-dimension scores and a weighted comparison across all configurations.
## Configuration
### `eval.yaml`
```yaml
name: "My Skill Evaluation"
description: "Evaluate my React skills"
runs: 3 # runs per configuration (increases reliability)
generation_model: "claude-opus-4.6" # model for code generation
analysis_model: "gpt-5.3-codex" # model for comparative analysis
scenarios:
- name: Dashboard
prompt: prompts/scenarios/dashboard.prompt.md
description: "Admin dashboard with charts and data tables"
- name: Blog
prompt: prompts/scenarios/blog.prompt.md
description: "Blog with comments and user authentication"
configurations:
- name: no-skills
label: "Baseline (no skills)"
skills: []
plugins: []
- name: my-react-skill
label: "My React Skill"
skills:
- source: react-components # references skill-sources.yaml
verification:
build:
command: "npm run build"
success_pattern: "compiled successfully"
format:
command: "npm run lint"
security:
vulnerability_scan: true
run:
command: "npm start"
timeout_seconds: 10
health_check:
url: "http://localhost:3000"
expected_status: 200
dimensions:
- name: "Component Architecture"
description: "Component composition and reusability"
what_to_look_for: "Check for atomic design, custom hooks, compound components."
why_it_matters: "Good architecture enables reuse and simplifies testing."
tier: high # critical (3×), high (2×), medium (1×), low (0.5×)
evaluation_method: llm # llm, automated, or hybrid
```
Configurations can reference skills/plugins in two ways:
- **Source reference** — `{source: my-source}` or `{source: my-source, path: subfolder}` (looks up `skill-sources.yaml`)
- **Local path** — `skills/my-skill` (plain string, relative to project root)
### `skill-sources.yaml`
Define where to get your skills and plugins:
```yaml
sources:
- name: react-components
type: git
url: https://github.com/myorg/react-copilot-skill
ref: main # optional: branch or tag
path: "." # optional: subfolder within repo
- name: my-local-skills
type: local
path: ./my-skills # relative to project root
```
Git sources are cloned into `~/.skill-eval/cache/` on first use and updated on subsequent runs. Override the cache location with `--cache-dir` or set `cache_dir` in `skill-sources.yaml`.
### Analysis Dimensions
Dimensions are the quality criteria used to score and compare generated code:
| Field | Purpose |
|-------|---------|
| `name` | Short label (e.g., "Error Handling") |
| `description` | What this dimension measures |
| `what_to_look_for` | Concrete things to check in the code |
| `why_it_matters` | Why this matters for production quality |
| `tier` | Weight tier: `critical` (3×), `high` (2×), `medium` (1×), `low` (0.5×) |
| `evaluation_method` | How it's evaluated: `llm`, `automated`, `hybrid` |
**Tips:**
- Start with 5–10 dimensions — you can always add more
- Be specific in `what_to_look_for` — vague criteria produce vague scores
- `skill-eval init` suggests stack-specific dimensions automatically
## CLI Reference
### Pipeline commands
```bash
skill-eval run # Full pipeline: generate → verify → analyze
skill-eval run --runs 5 # 5 runs per config for higher reliability
skill-eval run --skip-generate # Re-verify + re-analyze existing output
skill-eval run --analyze-only # Re-analyze only
skill-eval run -c my-skill -c baseline # Only run specific configurations
skill-eval run --generation-model claude-sonnet-4 # Override generation model
skill-eval run -m gpt-5.3-codex # Override analysis model
skill-eval generate # Generate code only
skill-eval generate --resume # Skip runs where output already exists
skill-eval generate --generation-model claude-opus-4.6 # Specify generation model
skill-eval verify # Verify builds only
skill-eval analyze # Analyze only
skill-eval analyze -m gpt-5.3-codex # Specify analysis model
```
### Setup & validation
```bash
skill-eval init # Interactive project setup
skill-eval validate-config # Check config is valid, sources resolve
skill-eval ci-setup # Generate GitHub Actions workflow
skill-eval ci-setup --schedule "0 6 * * 1" # With weekly schedule
```
### Global options
```bash
skill-eval --config path/to/eval.yaml run # Custom config path
skill-eval --skill-sources path/to/skill-sources.yaml run # Custom sources path
skill-eval --cache-dir /tmp/skill-cache run # Custom cache directory
skill-eval --output-dir ./my-output run # Custom output directory
skill-eval --reports-dir ./my-reports run # Custom reports directory
skill-eval --project-root /path/to/project run # Custom project root
```
## CI / GitHub Actions
Generate a workflow that runs evaluations in CI:
```bash
skill-eval ci-setup
```
This creates `.github/workflows/skill-eval.yml` with:
- Manual trigger via `workflow_dispatch` (with skip/analyze-only options)
- GitHub Copilot CLI installation and verification
- Python + skill-eval installation
- Config validation → pipeline execution
- Report and output artifact upload
**Required secret:** The workflow expects a `COPILOT_TOKEN` repository secret — a [fine-grained personal access token](https://github.com/settings/personal-access-tokens/new) with the **"Copilot Requests"** permission. The default `GITHUB_TOKEN` does not have Copilot access.
Options: `--runs-on`, `--python-version`, `--schedule`, `--timeout`.
## Using the Agent
If you have the Copilot agent feature, you can use `@skill-eval` for a conversational workflow:
```
@skill-eval I wrote a custom skill for ASP.NET Core Web APIs.
I want to see if it actually improves code generation quality.
```
The agent handles setup, execution, and interpretation. See the [Agent Guide](docs/agent-guide.md) for details.
## Examples
The `examples/` directory contains complete evaluation projects you can run immediately to see the framework in action.
| Example | Scenarios | Configurations | Dimensions |
|---------|-----------|---------------|------------|
| **[aspnet-webapi](examples/aspnet-webapi/)** | Fitness Studio API, Library API, Vet Clinic API | 5 configs (baseline + 4 skill sets) | 24 dimensions across 4 tiers |
| **[aspnet-razor-pages](examples/aspnet-razor-pages/)** | Event Registration, Property Management, Employee Directory | 5 configs (baseline + 4 skill sets) | 23 dimensions across 4 tiers |
Each example includes `eval.yaml`, `skill-sources.yaml` (with git references to external repos), and detailed scenario prompts.
### Running an example
```bash
# Clone the repo
git clone https://github.com/sayedihashimi/copilot-skill-eval
cd copilot-skill-eval
# Install the tool
pipx install .
# Navigate to an example and validate the config
# (this also clones the skill repos into ~/.skill-eval/cache/)
cd examples/aspnet-webapi
skill-eval validate-config
# Run the full pipeline
skill-eval run
# Or run individual steps
skill-eval generate # Generate code only
skill-eval generate -c dotnet-webapi # Generate just one configuration
skill-eval verify # Verify builds only
skill-eval analyze # Analyze only
# Run with more passes for higher reliability
skill-eval run --runs 5
```
Reports are written to `reports/`. The main output is `reports/analysis.md`.
## Further Reading
- [Scenario Authoring Guide](docs/authoring-guide.md) — how to write effective scenario prompts
- [Architecture & Internals](docs/architecture.md) — pipeline details, session tracing, source resolution
- [Agent Guide](docs/agent-guide.md) — using the conversational `@skill-eval` agent
## License
See [LICENSE](LICENSE) for details.