{"id":50970769,"url":"https://github.com/evalstate/birch-html","last_synced_at":"2026-06-19T02:02:23.381Z","repository":{"id":360056840,"uuid":"1247569875","full_name":"evalstate/birch-html","owner":"evalstate","description":"SKILL.md, Benchmark, GEPA Loop and Evaluation Data for an HTML generation skill","archived":false,"fork":false,"pushed_at":"2026-05-24T20:12:04.000Z","size":2246,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-24T21:06:42.821Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/evalstate.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-23T13:47:01.000Z","updated_at":"2026-05-24T20:12:08.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/evalstate/birch-html","commit_stats":null,"previous_names":["evalstate/birch-html"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/evalstate/birch-html","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evalstate%2Fbirch-html","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evalstate%2Fbirch-html/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evalstate%2Fbirch-html/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evalstate%2Fbirch-html/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/evalstate","download_url":"https://codeload.github.com/evalstate/birch-html/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evalstate%2Fbirch-html/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34514285,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-19T02:00:06.005Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-19T02:02:22.580Z","updated_at":"2026-06-19T02:02:23.374Z","avatar_url":"https://github.com/evalstate.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# The Birch HTML Skill\n\nA skill for producing self-contained HTML Artifacts in a consistent visual style. This specific style, and the idea of producing fully self-contained HTML outputs are from:\n\nhttps://thariqs.github.io/html-effectiveness/\nhttps://github.com/ThariqS/html-effectiveness/\n\n## Screenshots\n\n### VLM checker\n\n\u003cimg src=\"vlm-checker.png\" alt=\"VLM checker report screenshot\" width=\"800\"\u003e\n\n### Comparison Report\n\n\u003cimg src=\"comparison-report.png\" alt=\"Benchmark comparison report screenshot\" width=\"800\"\u003e\n\n### Sample Skill Output\n\n\u003cimg src=\"sample-output.png\" alt=\"Sample Birch HTML skill output screenshot\" width=\"800\"\u003e\n\n## What's Here\n\n- The `birch-html` skill itself.\n- An evaluation benchmark that runs both deterministic and visual tests on the generated outputs. The outputs from the released run are included.\n- A GEPA loop for candidate experimentation. \n\n## Why does this exist?\n\nI loved the clean `birchline` style from the \"HTML Effectiveness\" blog and examples, and wanted to use it myself. It was pretty straightforward to build a stylesheet, and set of rough templates and instructions. The examples shared by @thariqs all had slightly different stylesheets, so i did a bit of merging. \n\nI was using this as a plugin in `fast-agent` using my session model (usually GPT-5.5) to do the data processing and crunching in a look-ahead turn, and then sending it over to Codex Spark for rendering pointing at the stylesheet. That was quick, efficient and consistent.\n\nBut... what about producing good, consistent *standalone* artifacts in a sophisticated common visual style? Turns out to be harder than it looks... \n\n## What does the Skill do?\n\nProduces visually consistent standalone HTML artifacts, with some standard presentation recipes .\n\nSee for yourself: \u003chttps://evalstate-birch-html.hf.space/analysis/report.html\u003e.\n\nThis has been benchmarked with 13 different models to see how well they manage to follow the instructions and produce a visually consistent output.\n\nThe SKILL.md recommends the model use an HTML template and inject the target CSS rather than reproducing via inference. The skill also recommends, and gives access to the deterministic quality checker the benchmark uses.\n\nA few different approaches have been used while benchmarking - the published set does not include additional guidance or help for the model beyond the skill. This has a negative impact on some models that otherwise performed adequately. I decided to leave this (although they are easily recoverable) as it's a good demonstration of the difficulties of skill engineering and portability.\n\n## How are the scores generated?\n\nBoth deterministic and vision model assessments are conducted on the outputs. GPT-5.5 is used as the Vision Model for the checker.\n\n## Should I use this?\n\nThere's no reason not to; but I'd say in practice\n - This is an excellent starting point for your own recipes\n - Link to the existing stylesheet in the template; this includes a \"cheat\" to do a string replacement in the finished artifact to make the models job a little easier.\n - The plugin version described above performs better - having gone through this exercise I'll tidy it up with some of the learnings. \n\nIn it's current form this is more of an art project than a practical tool. The benchmark results _are_ interesting though.\n\n## The GEPA loop\n\nThe GEPA loop is good candidate generation and testing; but I wouldn't treat it as a full optimization path. \n\nThis Skill was mainly developed with GPT-5.5. One example of using the loop was to exert pressure on the length of the `SKILL.md` to keep it \u003c~220 lines, specifically for GPT-5.5. (OpenAI, please please fix this).\n\nIt's also useful for leaning in to model assumptions; and answering questions like; \n - Is the `SKILL.md` simply saying \"use tailwind classes and this colour theme\" more effective? \n - Is `.mdx` going to mog this approach completely (assumption is yes).\n - What happens if I optimize for a specific model with 20 loops?\n - Can I improve VLM prompting to pick up other defects consistently (e.g. train a VLM assessor against arbitrary inputs) \n - After manually inspecting different model outputs, can I use the VLM to optimise for my preferences? \n\n\u003e GPT-5.5 does respond correctly to being told the length of the file early in the `SKILL.md` and reads further chunks :)\n\n### Running GEPA\n\nThe active skill loop is:\n\n```bash\nuv run scripts/optimize_birch_skill_with_gepa.py --help\n```\n\nIt mutates only the LLM-facing skill files:\n\n- `skill/SKILL.md`\n- selected files under `skill/recipes/`\n\nEach candidate is copied into an isolated directory under:\n\n```text\neval-runs/gepa/\u003crun-name\u003e/candidate-XXX/\n```\n\nso failed generations or bad candidates do not modify the working tree.\n\nQuick smoke score of the current seed skill, without proposing changes:\n\n```bash\nBIRCH_VISION_REVIEW_CMD=off \\\nuv run scripts/optimize_birch_skill_with_gepa.py \\\n  --run-name smoke-seed \\\n  --task-model codexspark \\\n  --reflection-model codexresponses.gpt-5.5 \\\n  --evaluate-only \\\n  --eval-jobs 2 \\\n  --generation-timeout 900 \\\n  --eval-timeout 1800\n```\n\nSmall GEPA search:\n\n```bash\nBIRCH_VISION_REVIEW_CMD=off \\\nuv run scripts/optimize_birch_skill_with_gepa.py \\\n  --run-name birch-skill-codexspark-p4 \\\n  --task-model codexspark \\\n  --reflection-model codexresponses.gpt-5.5 \\\n  --proposals 4 \\\n  --eval-jobs 2 \\\n  --generation-timeout 900 \\\n  --eval-timeout 2400\n```\n\nUse a previous best candidate as the seed:\n\n```bash\nuv run scripts/optimize_birch_skill_with_gepa.py \\\n  --run-name continue-from-best \\\n  --seed-skill-dir eval-runs/gepa/birch-skill-codexspark-p4/best/skills/birch-html \\\n  --task-model codexspark \\\n  --reflection-model codexresponses.gpt-5.5 \\\n  --proposals 4\n```\n\nThe script expects GEPA sources at `~/source/gepa/src` by default. Override with:\n\n```bash\n--gepa-src /path/to/gepa/src\n```\n\n### GEPA ASI / feedback sources\n\nGEPA scoring and feedback currently use two evidence streams:\n\n1. **Deterministic ASI** — always on. Each candidate runs\n   `scripts/run_skill_evals.py`, which generates the five eval artifacts and\n   runs `scripts/check_birch_renderings.py` across desktop/mobile/deep\n   viewports. Failures and warnings feed `score.json` and actionable feedback.\n2. **VLM ASI** — optional screenshot smoke review. By default the GEPA evaluator\n   calls `scripts/review_birch_screenshots_with_vision.py` after deterministic\n   screenshots exist. VLM findings are surfaced in `score.json` and feedback as\n   visual smoke evidence.\n\nDisable VLM review for cheaper deterministic-only loops:\n\n```bash\nBIRCH_VISION_REVIEW_CMD=off uv run scripts/optimize_birch_skill_with_gepa.py ...\n```\n\nUse the default VLM reviewer:\n\n```bash\nuv run scripts/optimize_birch_skill_with_gepa.py ...\n```\n\nOr provide a custom reviewer command. It must accept:\n\n```text\n\u003ccandidate_dir\u003e \u003creports_dir\u003e\n```\n\nand write:\n\n```text\n\u003creports_dir\u003e/vision-findings.json\n```\n\nin the same shape as `scripts/review_birch_screenshots_with_vision.py`.\n\nCandidate outputs to inspect:\n\n```text\neval-runs/gepa/\u003crun-name\u003e/candidate-XXX/artifacts/\neval-runs/gepa/\u003crun-name\u003e/candidate-XXX/reports/\neval-runs/gepa/\u003crun-name\u003e/candidate-XXX/score.json\neval-runs/gepa/\u003crun-name\u003e/best/\n```\n\n\n## Active layout\n\n| Path | Purpose |\n|---|---|\n| `styles/birch-system.css` | Canonical Birch CSS tokens, layout primitives, and semantic components. |\n| `docs/birch-llm-style-guide.md` | LLM-facing Birch generation contract. |\n| `docs/birch-recipes/` | Recipe guidance for common artifact types. |\n| `scripts/birch-copy.js` | Optional browser enhancer for copyable code/command blocks. |\n| `scripts/birch_mpl.py` | Matplotlib helpers for Birch-styled chart generation. |\n| `scripts/check_birch_renderings.py` | Static/browser rendering checker for Birch artifacts. |\n| `scripts/run_skill_evals.py` | Single-model skill benchmark runner. |\n| `scripts/run_multimodel_skill_evals.py` | Multi-model benchmark runner. |\n| `scripts/optimize_birch_skill_with_gepa.py` | GEPA loop for skill/recipe candidate experiments. |\n| `evals/` | Prompt fixtures, sources, and rubrics for the eval harness. |\n| `eval-runs/` | Committed baseline and comparison artifacts. |\n\n\n## Running checks\n\nRun one benchmark model:\n\n```bash\nuv run scripts/run_skill_evals.py \\\n  --model codexspark \\\n  --label smoke-codexspark \\\n  --jobs 2\n```\n\nRun the multi-model benchmark:\n\n```bash\nuv run scripts/run_multimodel_skill_evals.py \\\n  --experiment my-run \\\n  --model codexspark \\\n  --model codexresponses.gpt-5.5 \\\n  --vision \\\n  --jobs 2\n```\n\nRegenerate and publish the browsing site to Hugging Face:\n\n```bash\nhf auth login\nscripts/publish_hf_space.sh\n```\n\nThe publish helper rebuilds `results/clean-final` and `analysis/report.html`,\nsyncs the static payload to `hf://buckets/\u003chf-user\u003e/birch-html`, and uploads a\nsmall Docker Space that mounts the bucket read-only and serves the report.\n\nCurrent published URL:\n\n```text\nhttps://evalstate-birch-html.hf.space/analysis/report.html\n```\n\nUseful overrides:\n\n```bash\nHF_NAMESPACE=evalstate scripts/publish_hf_space.sh\nLABEL_SUFFIX=publish-run scripts/publish_hf_space.sh\nDRY_RUN=1 scripts/publish_hf_space.sh\n```\n\nRun the Birch rendering checker directly against generated artifacts:\n\n```bash\nuv run  scripts/check_birch_renderings.py \\\n  --artifact eval-runs/skill-baseline-gpt55/numeric-data.html \\\n  --out reports/birch-rendering-check.json \\\n  --markdown reports/birch-rendering-check.md\n```\n\nUse `--pair reference.html:candidate.html` only when there is a meaningful\nreference artifact; eval-generated artifacts use candidate-only visual smoke\nchecks rather than same:same screenshot comparisons.\n\nThe checker writes generated reports/screenshots to `reports/` by default. That\nfolder is intentionally not part of the clean top-level baseline; historical\noutputs live in `archive/generated-reports/`.\n\n## Attribution and license\n\nThis project includes code and materials derived from the original \"HTML Effectiveness\" \nproject by **Thariq Shihipar / Anthropic PBC**.\n\nThe original project is licensed under the Apache License, Version 2.0. This\nrepository is also distributed under the Apache License, Version 2.0. See\n[`LICENSE`](LICENSE) for details.\n\nModifications in this repository include:\n\n- SKILL.md, and associated evaluation and harness scripts\n- benchmark result packaging;\n- deterministic/VLM report consolidation;\n- generated analysis tables;\n- SVG figures;\n- static HTML report microsite;\n- publication-oriented documentation.\n\nWhere source files retain original copyright or license notices, those notices\nhave been preserved.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevalstate%2Fbirch-html","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fevalstate%2Fbirch-html","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevalstate%2Fbirch-html/lists"}