{"id":28712525,"url":"https://github.com/rootly-ai-labs/efcb","last_synced_at":"2026-02-21T04:04:17.324Z","repository":{"id":295837751,"uuid":"990150350","full_name":"Rootly-AI-Labs/efcb","owner":"Rootly-AI-Labs","description":"The Environment-Free Coding Benchmark (EFCB) suite","archived":false,"fork":false,"pushed_at":"2025-06-10T16:09:38.000Z","size":4141,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-20T05:58:47.629Z","etag":null,"topics":["benchmark","coding","llm"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Rootly-AI-Labs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-25T15:53:14.000Z","updated_at":"2025-06-10T18:12:37.000Z","dependencies_parsed_at":"2025-05-27T16:52:21.489Z","dependency_job_id":null,"html_url":"https://github.com/Rootly-AI-Labs/efcb","commit_stats":null,"previous_names":["rootly-ai-labs/efcb"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Rootly-AI-Labs/efcb","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rootly-AI-Labs%2Fefcb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rootly-AI-Labs%2Fefcb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rootly-AI-Labs%2Fefcb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rootly-AI-Labs%2Fefcb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Rootly-AI-Labs","download_url":"https://codeload.github.com/Rootly-AI-Labs/efcb/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rootly-AI-Labs%2Fefcb/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29672786,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-21T03:11:15.450Z","status":"ssl_error","status_checked_at":"2026-02-21T03:10:34.920Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","coding","llm"],"created_at":"2025-06-14T23:06:22.368Z","updated_at":"2026-02-21T04:04:17.307Z","avatar_url":"https://github.com/Rootly-AI-Labs.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eEnvironment-Free Coding Benchmark ⚗️\u003c/h1\u003e\n\nThis coding benchmark aims to offer robust and challenging tasks to evaluate\nlanguage models on, without requiring the need to use a coding environment to\nsimplify the benchmark setup.\n\n| Prototype Ready? | Production-Ready? | Name | Categories | Eval | Description |\n| --- | --- | --- | --- | --- | --- |\n| Yes ✅| Not yet | GMCQ-easy | understanding | mcq | choose the correct code diff that closes a PR |\n| Yes ✅| Not yet | GMCQ-hard | understanding | mcq | choose the correct code diff that closes a PR, with additions and removals |\n| Yes ✅| Not yet| Reverse-QA | text generation | LLM-as-a-judge | generate an issue title and body given code diff |\n| Yes ✅ | Not yet | MPR-Gen | code generation | LLM-as-a-judge | given a maksed section of a code diff, generate the code |\n| Yes ✅ | Not yet | Reverse-QA-Hallu | hallucination detection | LLM-as-a-judge | uses an LLM-as-a-judge to determine whether the model hallucinated |\n\n## Overall Results (v0.3) 🏆\n\n| Model Name                   | GMCQ-Easy | GMCQ-Hard | MPR-Gen | Reverse-QA | Reverse-QA-Hallu | EFCB Score |\n|------------------------------|----------|--------|-------------|--------|-----------|--------|\n| together/llama-3.3-70b-turbo |    0.803 |  0.476 |   3.74 |  7.23   | 0.965 |   0.482 |\n| openai/o4-mini               |    0.892 |  0.868 |    3.67 |   7.41 |    0.946 |    0.584 |\n\n\n## Detailed Results (v0.3) 📊\n\n### GMCQ-Easy (v0.3)\n\n| Model Name                   | mastodon | indigo | cloudflared | duckdb | tailscale | chroma | unweighted average |\n|------------------------------|----------|--------|-------------|--------|-----------|--------|--------------------|\n| together/llama-3.3-70b-turbo |    0.912 |  0.699 |       0.845 |    0.8 |     0.759 |  0.801 |              0.803 |\n| openai/o4-mini               |    0.975 |  0.869 |       0.872 |  0.893 |     0.824 |  0.918 |              0.892 |\n| anthropic/claude-4-sonnet    |     0.95 |  0.801 |       0.851 |  0.856 |     0.786 |  0.862 |              0.851 |\n\n### GMCQ-Hard (v0.3)\n\n| Model Name                   | mastodon | indigo | cloudflared | duckdb | tailscale | chroma | unweighted average |\n|------------------------------|----------|--------|-------------|--------|-----------|--------|--------------------|\n| together/llama-3.3-70b-turbo |    0.452 |  0.392 |       0.574 |   0.47 |     0.465 |    0.5 |              0.476 |\n| openai/o4-mini               |    0.883 |  0.824 |       0.919 |  0.879 |     0.834 |  0.867 |              0.868 |\n\n### MPR-Gen (v0.3)\n\n| Model Name                   | mastodon | indigo | cloudflared | duckdb | tailscale | chroma | unweighted average |\n|------------------------------|----------|--------|-------------|--------|-----------|--------|--------------------|\n| together/llama-3.3-70b-turbo |     7.48 |    6.9 |        6.88 |   7.18 |       7.5 |   7.45 |               7.23 |\n| openai/o4-mini               |      7.5 |   7.35 |        7.49 |   7.29 |      7.73 |   7.11 |               7.41 |\n\n### Reverse-QA (v0.3)\n\n| Model Name                   | mastodon | indigo | cloudflared | duckdb | tailscale | chroma | unweighted average |\n|------------------------------|----------|--------|-------------|--------|-----------|--------|--------------------|\n| together/llama-3.3-70b-turbo |     3.94 |    3.4 |        4.01 |   3.97 |      3.84 |   3.28 |              3.740 |\n| openai/o4-mini               |     4.26 |   3.36 |        4.07 |    3.4 |      3.74 |   3.18 |              3.668 |\n\n### Reverse-QA-Hallu (v0.3)\n\n| Model Name                   | mastodon | indigo | cloudflared | duckdb | tailscale | chroma | unweighted average |\n|------------------------------|----------|--------|-------------|--------|-----------|--------|--------------------|\n| together/llama-3.3-70b-turbo |    0.941 |  0.983 |       0.959 |  0.967 |     0.984 |  0.954 |              0.965 |\n| openai/o4-mini               |    0.946 |  0.949 |       0.966 |  0.944 |     0.957 |  0.913 |              0.946 |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frootly-ai-labs%2Fefcb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frootly-ai-labs%2Fefcb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frootly-ai-labs%2Fefcb/lists"}