{"id":49539899,"url":"https://github.com/nicktcode/llm-eval-framework","last_synced_at":"2026-05-02T14:09:01.916Z","repository":{"id":351645172,"uuid":"1211821286","full_name":"nicktcode/llm-eval-framework","owner":"nicktcode","description":"Config-driven LLM evaluation framework measuring factual accuracy, instruction adherence, hallucination, and consistency","archived":false,"fork":false,"pushed_at":"2026-04-15T21:05:29.000Z","size":72,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-15T23:12:10.351Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nicktcode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-15T19:29:33.000Z","updated_at":"2026-04-15T21:05:33.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/nicktcode/llm-eval-framework","commit_stats":null,"previous_names":["nicktcode/llm-eval-framework"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/nicktcode/llm-eval-framework","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicktcode%2Fllm-eval-framework","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicktcode%2Fllm-eval-framework/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicktcode%2Fllm-eval-framework/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicktcode%2Fllm-eval-framework/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nicktcode","download_url":"https://codeload.github.com/nicktcode/llm-eval-framework/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicktcode%2Fllm-eval-framework/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32536672,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-02T12:25:33.646Z","status":"ssl_error","status_checked_at":"2026-05-02T12:24:51.733Z","response_time":132,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-02T14:09:01.259Z","updated_at":"2026-05-02T14:09:01.903Z","avatar_url":"https://github.com/nicktcode.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# llm-eval-framework\n\nConfig-driven framework for evaluating LLM outputs across four dimensions: factual accuracy, instruction adherence, hallucination detection, and output consistency. Uses the LLM-as-judge pattern with Claude evaluating Claude.\n\n## Why I built this\n\nI kept running into the same problem: I'd tweak a system prompt or change the temperature and have no structured way to tell if things got better or worse. Existing benchmarks are great for leaderboard comparisons, but they don't let you define your own rubrics or test your own prompts. I wanted something closer to a unit test suite for model behavior.\n\nThe LLM-as-judge approach has real problems (more on that below), but it's fast enough for iterating and the framework is set up to show you where the judge contradicts itself, so you know when to be skeptical.\n\n## What it measures\n\nFour dimensions, each scored 1-5 by a judge model.\n\n*Factual accuracy* checks if the model gets verifiable facts right. I test it with questions that have known answers (chemical symbols, historical dates, that sort of thing) and compare the response against expected facts. A fast string-match heuristic runs alongside the LLM judge for sanity checking.\n\n*Instruction adherence* checks whether the model follows formatting constraints. Things like \"respond in exactly 5 lines\" or \"return valid JSON with these keys.\" This dimension has both an LLM judge score and automated heuristic checks (counting lines, parsing JSON, checking word counts). The heuristics catch cases where the judge says the response looks fine but the format is technically wrong.\n\n*Hallucination detection* gives the model a context paragraph and asks it questions. Some questions are answerable from the text, some aren't. A good model should say \"the text doesn't mention that\" instead of making something up. The scorer tracks word overlap between the response and source context as an additional grounding signal.\n\n*Consistency* runs the same prompt 5 times and measures how much the answers vary. I use pairwise semantic similarity judgments (via the LLM judge) rather than exact string matching, since two correct answers can be worded differently. High variance on factual questions usually means the model is guessing.\n\n## Architecture\n\n```\nllm-eval-framework/\n    config.yaml              # Model, judge, experiment settings\n    prompts/\n        test_prompts.json    # 42 test prompts across difficulty levels\n        rubrics.yaml         # Scoring rubrics for each dimension\n    eval/\n        runner.py            # Main evaluation orchestrator\n        judges.py            # LLM-as-judge API calls and parsing\n        metrics.py           # Aggregation and statistics\n        dimensions/\n            factual.py       # Factual accuracy scoring\n            adherence.py     # Instruction following checks\n            hallucination.py # Grounding and hallucination detection\n            consistency.py   # Multi-trial consistency scoring\n    analysis/\n        visualize.py         # Chart generation (colorblind-friendly)\n        report.py            # Markdown report generation\n    tests/\n        test_judges.py\n        test_metrics.py\n        test_runner.py\n```\n\n## Setup\n\n```bash\ngit clone https://github.com/nicktcode/llm-eval-framework.git\ncd llm-eval-framework\npip install -r requirements.txt\nexport ANTHROPIC_API_KEY=your-key-here\n```\n\n## Usage\n\nRun a full evaluation:\n```bash\npython -m eval.runner --experiment full --temperature 0.0\n```\n\nCompare temperature settings:\n```bash\npython -m eval.runner --experiment temperature\n```\n\nCompare system prompts (minimal vs. detailed):\n```bash\npython -m eval.runner --experiment system_prompt\n```\n\nTest judge self-agreement:\n```bash\npython -m eval.runner --experiment judge_agreement\n```\n\nRun tests (no API key needed):\n```bash\npython -m pytest tests/ -v\n```\n\n## Design decisions\n\nRubric definitions are separate from scoring logic. The rubrics live in `prompts/rubrics.yaml` and the code that applies them lives in the dimension modules. You can change scoring criteria without touching Python, and different teams can maintain their own rubric files for their use cases.\n\nEverything configurable goes in `config.yaml`. Model name, temperatures to compare, system prompts, number of consistency trials. I didn't want magic numbers buried in source files.\n\nFor instruction adherence I run deterministic heuristic checks (line counts, JSON validation, word counts) in parallel with the LLM judge. When they disagree, that's useful data about the judge's reliability on that particular check type.\n\nThe visualization module uses the IBM Design Library color palette, which is designed for accessibility across common forms of color vision deficiency.\n\nSome prompts are designed to bait the model into hallucinating by asking about things not in the provided context. The hallucination detector recognizes a correct refusal (\"the text doesn't mention that\") as a positive signal rather than penalizing it.\n\n## Known limitations\n\nI want to be clear about the problems with LLM-as-judge:\n\n- Claude judging Claude will probably inflate scores. A more honest test would cross-evaluate across model families.\n- Longer responses tend to score higher even when a shorter answer is just as good. The rubrics push against this but don't fully solve it.\n- In pairwise consistency comparisons, the response listed first might be favored. I haven't added position randomization yet.\n- The 1-5 scale is coarse. Some evaluations would benefit from finer-grained scoring, but the judge gets unreliable when you ask it to distinguish between adjacent points on a wider scale.\n\n## Sample results\n\nResults below are from a run using `claude-3-5-sonnet-20241022` with the minimal system prompt. Single run, not averaged across sessions.\n\n### Scores at temperature 0.0\n\n| Dimension | Mean | Std Dev | Min | Max | N |\n|-----------|------|---------|-----|-----|---|\n| Factual Accuracy | 4.58 | 0.67 | 3 | 5 | 12 |\n| Instruction Adherence | 4.17 | 0.94 | 2 | 5 | 12 |\n| Hallucination Detection | 4.75 | 0.46 | 4 | 5 | 8 |\n| Consistency | 4.23 | 0.71 | 3 | 5 | 8 |\n\n### Temperature comparison\n\n| Condition | Factual | Adherence | Hallucination | Consistency | Overall |\n|-----------|---------|-----------|---------------|-------------|---------|\n| temp 0.0 | 4.58 | 4.17 | 4.75 | 4.23 | 4.43 |\n| temp 0.5 | 4.42 | 3.92 | 4.50 | 3.81 | 4.16 |\n| temp 1.0 | 4.17 | 3.58 | 4.13 | 3.19 | 3.77 |\n\nNo surprise: lower temperature gives more consistent and accurate outputs. Consistency shows the biggest drop at high temperature, which is what you'd expect since temperature directly controls sampling randomness.\n\nThe interesting thing is that instruction adherence degrades faster than factual accuracy. I think factual knowledge is baked into the weights and doesn't move much with sampling noise. But following format constraints requires holding instructions in working memory across the full generation, and that's more fragile.\n\n### System prompt comparison\n\n| Condition | Factual | Adherence | Hallucination | Consistency | Overall |\n|-----------|---------|-----------|---------------|-------------|---------|\n| detailed | 4.67 | 4.33 | 4.88 | 4.15 | 4.51 |\n| minimal | 4.58 | 4.17 | 4.75 | 4.23 | 4.43 |\n\nThe detailed system prompt helps a bit, but less than I expected. Hallucination detection benefits the most from explicit grounding instructions.\n\n### Judge self-agreement\n\nOver 3 trials scoring the same 5 prompt-response pairs:\n\n- Exact agreement rate: 80%\n- Mean score deviation: 0.20\n\nWhen the judge disagreed with itself it was always by 1 point (a 4 vs a 5, that kind of thing). Never saw a 2-point swing, which is encouraging for calibration even though it's not perfectly deterministic.\n\n## What I'd build next\n\nIf I were taking this further:\n\n1. Position randomization for pairwise consistency judgments to control for ordering effects.\n2. Cross-model evaluation. Use a different model family as judge to quantify self-preference bias.\n3. A web UI for browsing individual prompt results instead of reading markdown tables.\n4. RAG-specific evaluation where retrieval quality is scored separately from generation quality.\n5. Rubric versioning to track how scoring criteria change over time and whether those changes actually improve signal.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnicktcode%2Fllm-eval-framework","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnicktcode%2Fllm-eval-framework","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnicktcode%2Fllm-eval-framework/lists"}