{"id":45082887,"url":"https://github.com/huggingface/upskill","last_synced_at":"2026-03-04T11:00:27.424Z","repository":{"id":333926971,"uuid":"1138991978","full_name":"huggingface/upskill","owner":"huggingface","description":"Generate and evaluate agent skills for code agents like Claude Code, Open Code, OpenAI Codex","archived":false,"fork":false,"pushed_at":"2026-02-18T16:08:30.000Z","size":298,"stargazers_count":334,"open_issues_count":9,"forks_count":40,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-02-18T17:53:59.007Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/huggingface.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-21T11:32:04.000Z","updated_at":"2026-02-18T14:08:03.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/huggingface/upskill","commit_stats":null,"previous_names":["huggingface/upskill"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/huggingface/upskill","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fupskill","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fupskill/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fupskill/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fupskill/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/huggingface","download_url":"https://codeload.github.com/huggingface/upskill/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fupskill/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30078397,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T08:01:56.766Z","status":"ssl_error","status_checked_at":"2026-03-04T08:00:42.919Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-19T15:00:33.006Z","updated_at":"2026-03-04T11:00:27.400Z","avatar_url":"https://github.com/huggingface.png","language":"Python","funding_links":[],"categories":["Phase 4: Benchmarks and Research","Skill Builders"],"sub_categories":["Benchmarks and Evaluation"],"readme":"\u003cimg width=\"1920\" height=\"1080\" alt=\"upskill_banner\" src=\"https://github.com/user-attachments/assets/b71fd417-7d23-4f5d-aa89-06ea6b284d1b\" /\u003e\n\n# UPskill\n\nGenerate and evaluate agent skills based on traces with agents. Create skills with teacher models (expensive/slow) that student models (cheap/fast) can use to perform harder tasks reliably.\n\n## Quick Start\n\nInstall upskill:\n\n```bash\npip install upskill\n# or just use uv\nuvx upskill\n```\n\nCreate a new skill\n\n```bash\nupskill generate \"write good git commit messages\"\n# or based on previous agent traces\nupskill generate \"document the pattern\" --from ./trace.md\n# Skills are saved to ./skills/{skill-name}/ by default\n```\n\nGenerate a skill with a teaching model and evaluate it on a student model.\n\n```bash\nupskill generate \"write good git commit messages\" --model sonnet --eval-model haiku\n```\n\nBenchmark a set of models against a skill.\n\n```bash\nupskill eval ./skills/git-commit-messages/ -m haiku -m sonnet\n# logs pretty printed to the terminal\n```\n\nView the results later.\n\n```bash\nupskill runs --skill git-commit-messages\n```\n\n## Model Handling Overview\n\nupskill uses distinct phases with explicit model roles:\n\n- **Skill generation**: create/refine `SKILL.md`\n- **Test generation**: create synthetic evaluation cases\n- **Evaluation**: run tests against evaluator model(s)\n- **Benchmark**: repeated evaluation across multiple runs/models\n\nModel flags by command:\n\n| Command | Flag | Meaning |\n|---|---|---|\n| `generate` | `--model` | Skill generation/refinement model |\n| `generate` | `--test-gen-model` | Test generation model override |\n| `generate` | `--eval-model` | Optional extra cross-model eval pass |\n| `eval` | `-m/--model` | Evaluation model(s) (repeatable) |\n| `eval` | `--test-gen-model` | Test generation model override (when tests are generated) |\n| `benchmark` | `-m/--model` | Evaluation model(s) to benchmark |\n| `benchmark` | `--test-gen-model` | Test generation model override (when tests are generated) |\n| `runs` / `plot` | `-m/--model` | Historical results filter only |\n\n`upskill eval` enters **benchmark mode** whenever you pass multiple `-m` values or `--runs \u003e 1`.\nIn benchmark mode, baseline comparison is always off; `--no-baseline` is redundant.\n\n## Commands\n\n### `upskill generate`\n\nGenerate a skill from a task description with automatic evaluation and refinement.\n\n```bash\nupskill generate TASK [OPTIONS]\n```\n\n**Arguments:**\n- `TASK` - Description of what the skill should teach\n\n**Options:**\n- `-e, --example` - Input -\u003e output example (can be repeated)\n- `--tool` - Generate from MCP tool schema (path#tool_name)\n- `-f, --from PATH` - Improve from existing skill dir or agent trace file (auto-detected)\n- `-m, --model MODEL` - Skill generation model (e.g., 'sonnet', 'haiku', 'anthropic.claude-sonnet-4-20250514')\n- `--test-gen-model MODEL` - Override test generation model for this run\n- `-o, --output PATH` - Output directory for skill\n- `--no-eval` - Skip evaluation and refinement\n- `--eval-model MODEL` - Different model to evaluate skill on\n- `--runs-dir PATH` - Directory for run logs (default: ./runs)\n- `--log-runs / --no-log-runs` - Log run data (default: enabled)\n\n**Examples:**\n\n```bash\n# Basic usage\nupskill generate \"parse JSON Schema files\"\n\n# Make and evaluate skills for less powerful models\nupskill generate \"write git commits\" --model sonnet --eval-model haiku\n\n# Improve an existing skill (auto-detected as directory)\nupskill generate \"add more error handling examples\" --from ./skills/api-errors/\n\n# Generate from an agent trace file (auto-detected as file)\nupskill generate \"document the pattern\" --from ./trace.json\n\n# Skip evaluation during generation (evaluate separately with upskill eval)\nupskill generate \"parse YAML\" --no-eval\n```\n\n**Output:**\n\n```\nGenerating skill with sonnet...\nGenerating test cases...\nEvaluating on sonnet... (attempt 1)\n  60% -\u003e 100% (+40%) OK\n\n  git-commit-messages\n  Write clear, conventional commit messages that follow best practices.\n\n  SKILL.md              ~450 tokens\n\n  baseline   ████████████░░░░░░░░   60%\n  with skill ████████████████████  100%  (+40%)\n\n  tokens: 1200 → 800  (-33%)\n\nSaved to ./skills/git-commit-messages\n```\n\n### `upskill eval`\n\nEvaluate an existing skill against test cases. Supports single-model evaluation with baseline comparison, or multi-model benchmarking.\n\n```bash\nupskill eval SKILL_PATH [OPTIONS]\n```\n\n**Arguments:**\n- `SKILL_PATH` - Path to skill directory containing SKILL.md\n\n**Options:**\n- `-t, --tests PATH` - Test cases JSON file\n- `-m, --model MODEL` - Model(s) to evaluate against (repeatable for multi-model benchmarking)\n- `--test-gen-model MODEL` - Override test generation model when tests must be generated\n- `--runs N` - Number of runs per model (default: 1)\n- `--no-baseline` - Skip baseline comparison (simple eval mode only; ignored in benchmark mode)\n- `-v, --verbose` - Show per-test results\n- `--log-runs / --no-log-runs` - Log run data (default: enabled)\n- `--runs-dir PATH` - Directory for run logs\n\n**Examples:**\n\n```bash\n# Basic evaluation with baseline comparison\nupskill eval ./skills/my-skill/\n\n# With verbose output\nupskill eval ./skills/my-skill/ -v\n\n# Custom test cases\nupskill eval ./skills/my-skill/ --tests ./tests.json\n\n# Evaluate on specific model\nupskill eval ./skills/my-skill/ -m haiku\n\n# Multi-model benchmarking (compare models)\nupskill eval ./skills/my-skill/ -m haiku -m sonnet\n\n# Multiple runs per model for statistical significance\nupskill eval ./skills/my-skill/ -m haiku -m sonnet --runs 5\n\n# Evaluate a local model configured in fast-agent\nupskill eval ./skills/my-skill/ -m generic.my-model\n\n# Skip baseline (just test with skill)\nupskill eval ./skills/my-skill/ --no-baseline\n\n# Benchmark mode is triggered by multiple models OR --runs \u003e 1\nupskill eval ./skills/my-skill/ -m haiku --runs 5\n\n# Disable run logging\nupskill eval ./skills/my-skill/ --no-log-runs\n```\n\n**Benchmark output:**\n\n```\nEvaluating my-skill across 2 model(s)\n  3 test case(s), 5 run(s) per model\n\nhaiku\n  Pass rate: 4/5 (80%)  Avg assertions: 2.8/3\n\nsonnet\n  Pass rate: 5/5 (100%)  Avg assertions: 3.0/3\n\n┏━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ Model  ┃ Pass Rate ┃ Avg Assertions ┃ Avg Tokens ┃\n┡━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ haiku  │ 4/5       │ 2.8/3          │ 1250       │\n│ sonnet │ 5/5       │ 3.0/3          │ 1890       │\n└────────┴───────────┴────────────────┴────────────┘\n```\n\n**Test cases JSON format:**\n\n```json\n[\n  {\"input\": \"Write a commit for adding login\", \"expected\": {\"contains\": [\"feat\", \"login\"]}},\n  {\"input\": \"Fix the null pointer bug\", \"expected\": {\"contains\": [\"fix\", \"bug\"]}}\n]\n```\n\n### `upskill list`\n\nList all generated skills in a tree view.\n\n```bash\nupskill list [OPTIONS]\n```\n\n**Options:**\n- `-d, --dir PATH` - Skills directory to list\n- `-v, --verbose` - Show skill contents preview\n\n**Examples:**\n\n```bash\n# List skills in default directory\nupskill list\n\n# List from custom directory\nupskill list -d ./my-skills/\n\n# Show preview of skill contents\nupskill list -v\n```\n\n**Output:**\n\n```\n./skills\n├── git-commit-messages\n│   ├── Write clear, conventional commit messages...\n│   └── files\n│       └── SKILL.md\n├── api-error-handling\n│   ├── Handle API errors gracefully with proper logging...\n│   └── files\n│       ├── SKILL.md\n│       └── references/error-codes.md\n└── yaml-parsing\n    ├── Parse YAML files safely with schema validation...\n    └── files\n        ├── SKILL.md\n        └── scripts/validate.py\n```\n\n### `upskill runs`\n\nView run results as a plot, or export to CSV. By default, shows a visual comparison of baseline vs with-skill performance.\n\n```bash\nupskill runs [OPTIONS]\n```\n\n**Options:**\n- `-d, --dir PATH` - Runs directory\n- `-s, --skill TEXT` - Filter by skill name(s) (repeatable)\n- `-m, --model TEXT` - Filter historical run data by model(s) (repeatable)\n- `--metric [success|tokens]` - Metric to display (default: success)\n- `--csv PATH` - Export to CSV instead of plot\n\n**Examples:**\n\n```bash\n# View results plot (default)\nupskill runs\n\n# Filter by skill and models\nupskill runs -s my-skill -m haiku -m sonnet\n\n# Show token usage instead of success rate\nupskill runs --metric tokens\n\n# Export to CSV\nupskill runs --csv ./results.csv\n\n# Custom runs directory\nupskill runs -d ./my-runs/\n```\n\n**Plot output:**\n\n```\nskill: git-commit-messages\n\nhaiku\n  baseline   ████████████░░░░░░░░   60%\n  with skill ████████████████░░░░   80%  (+20%)\n\nsonnet\n  baseline   ████████████░░░░░░░░   60%\n  with skill ████████████████████  100%  (+40%)\n```\n\n**Matrix view (multiple skills and models):**\n\n```\n┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓\n┃ skill               ┃ haiku        ┃ sonnet       ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩\n│ git-commit-messages │ 60%→80%      │ 60%→100%     │\n│ api-error-handling  │ 40%→70%      │ 50%→90%      │\n│ yaml-parsing        │ 70%→90%      │ 80%→100%     │\n└─────────────────────┴──────────────┴──────────────┘\n```\n\n## Skill Output Format\n\nSkills are saved in a standard directory format:\n\n```\n./skills/{skill-name}/\n├── SKILL.md          # Main skill instructions\n├── references/       # Supporting documents (optional)\n└── scripts/          # Executable scripts (optional)\n```\n\n**Example SKILL.md:**\n\n```markdown\n# git-commit-messages\n\nWrite clear, conventional commit messages that follow best practices.\n\n## Instructions\n\nThis skill teaches how to write effective git commit messages\nfollowing the Conventional Commits specification.\n\n## Format\n\nCommit messages should follow this structure:\n\n\u003ctype\u003e(\u003cscope\u003e): \u003csubject\u003e\n\n\u003cbody\u003e\n\n\u003cfooter\u003e\n\n## Types\n\n- `feat`: New feature\n- `fix`: Bug fix\n- `docs`: Documentation changes\n...\n\n## Examples\n\n### Simple feature commit\nfeat(auth): add password reset functionality\n\n### Bug fix with explanation\nfix(api): handle null response from user service\n\nThe user service can return null when not found.\nAdded proper null checking to prevent crashes.\n\nCloses #123\n```\n\n## Run Logging\n\nBy default, upskill logs all runs to `./runs/`. Each run creates:\n\n```\n./runs/\n├── 2025_01_21_15_30/           # Batch folder (timestamp)\n│   ├── run_1/\n│   │   ├── run_metadata.json   # Model, task, timing\n│   │   └── run_result.json     # Pass/fail, assertions, tokens\n│   ├── run_2/\n│   │   └── ...\n│   └── batch_summary.json      # Aggregate results\n└── results.csv                 # Summary CSV (after `upskill runs`)\n```\n\nDisable with `--no-log-runs`.\n\n## Configuration\n\n### upskill config (`./upskill.config.yaml`)\n\n```yaml\nskill_generation_model: sonnet   # Default skill generation model\neval_model: haiku               # Default evaluation model (optional)\ntest_gen_model: null            # Optional test generation model\nskills_dir: ./skills            # Where to save skills\nruns_dir: ./runs                # Where to save run logs\nmax_refine_attempts: 3          # Refinement iterations\n```\n\n`test_gen_model` fallback behavior:\n\n- CLI `--test-gen-model` overrides config for a single run.\n- If set, test generation uses `test_gen_model`.\n- If unset, test generation falls back to `skill_generation_model`.\n- For `eval`/`benchmark`, this intentionally uses `skill_generation_model` (not `eval_model`) so generated tests stay\n  stable when sweeping multiple evaluation models.\n\nBackward compatibility: `model` is still accepted in config files as a legacy alias for\n`skill_generation_model`.\n\nConfig lookup order:\n\n1. `UPSKILL_CONFIG` environment variable (path)\n2. `./upskill.config.yaml` (project local)\n3. `~/.config/upskill/config.yaml` (legacy fallback)\n\n### FastAgent config (`fastagent.config.yaml`)\n\nPlace in your project directory to customize FastAgent settings:\n\n```yaml\ndefault_model: sonnet\n\nlogger:\n  progress_display: true\n  show_chat: false\n  streaming: markdown\n\n# MCP servers (optional)\nmcp:\n  servers:\n    fetch:\n      command: \"uvx\"\n      args: [\"mcp-server-fetch\"]\n```\n\n## Environment Variables\n\n```bash\n# Required for Anthropic models\nANTHROPIC_API_KEY=sk-ant-...\n\n# Required for OpenAI models\nOPENAI_API_KEY=sk-...\n\n# Optional: custom endpoints\nANTHROPIC_BASE_URL=http://localhost:8080\nOPENAI_API_BASE=http://localhost:11434/v1\n\n# For local models (generic provider)\nGENERIC_BASE_URL=http://localhost:8080/v1\nGENERIC_API_KEY=local  # Optional, defaults to \"local\"\n```\n\n## Python API\n\n```python\nfrom upskill import (\n    generate_skill,\n    generate_tests,\n    evaluate_skill,\n    refine_skill,\n    Config,\n)\n\n# Load configuration\nconfig = Config.load()\n\n# Generate a skill\nskill = await generate_skill(\n    \"parse JSON Schema files\",\n    model=\"sonnet\",\n    config=config,\n)\n\n# Generate test cases\ntests = await generate_tests(\"parse JSON Schema files\")\n\n# Evaluate the skill\nresults = await evaluate_skill(\n    skill,\n    tests,\n    model=\"haiku\",\n    config=config,\n)\n\nprint(f\"Skill lift: {results.skill_lift:.0%}\")\nprint(f\"Token savings: {results.token_savings:.0%}\")\nprint(f\"Is beneficial: {results.is_beneficial}\")\n\n# Refine based on failures\nif not results.is_beneficial:\n    from upskill.evaluate import get_failure_descriptions\n    failures = get_failure_descriptions(results)\n    improved_skill = await refine_skill(skill, failures)\n```\n\n## Model Format\n\nupskill uses FastAgent model format:\n\n```\n\u003cprovider\u003e.\u003cmodel\u003e.\u003creasoning_effort?\u003e\n```\n\n**Examples:**\n- `sonnet` - Anthropic Claude Sonnet (alias)\n- `haiku` - Anthropic Claude Haiku (alias)\n- `opus` - Anthropic Claude Opus (alias)\n- `anthropic.claude-sonnet-4-20250514` - Full model name\n- `openai.gpt-4.1` - OpenAI GPT-4.1\n- `openai.o3-mini.low` - OpenAI o3-mini with low reasoning effort\n- `generic.llama3.2:latest` - Local model via Ollama\n- `generic.my-model` - Local model via llama.cpp or other OpenAI-compatible server\n\n## Local Models\n\nupskill supports local models through any OpenAI-compatible endpoint (Ollama, llama.cpp, vLLM, etc.).\n\n**Quick start with Ollama:**\n\n```bash\n# Start Ollama (default port 11434)\nollama serve\n\n# Configure endpoint via fast-agent config/env, then evaluate\nupskill eval ./skills/my-skill/ --model generic.llama3.2:latest\n```\n\n**With llama.cpp server:**\n\n```bash\n# Start llama.cpp server\n./llama-server -m model.gguf --port 8080\n\n# Configure endpoint via fast-agent config/env, then evaluate\nupskill eval ./skills/my-skill/ --model generic.my-model\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fupskill","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuggingface%2Fupskill","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fupskill/lists"}