https://github.com/rootly-ai-labs/efcb

The Environment-Free Coding Benchmark (EFCB) suite
https://github.com/rootly-ai-labs/efcb

benchmark coding llm

Last synced: 4 months ago
JSON representation

The Environment-Free Coding Benchmark (EFCB) suite

Host: GitHub
URL: https://github.com/rootly-ai-labs/efcb
Owner: Rootly-AI-Labs
Created: 2025-05-25T15:53:14.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-10T16:09:38.000Z (about 1 year ago)
Last Synced: 2025-10-20T05:58:47.629Z (8 months ago)
Topics: benchmark, coding, llm
Homepage:
Size: 3.95 MB
Stars: 3
Watchers: 0
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          
Environment-Free Coding Benchmark ⚗️


This coding benchmark aims to offer robust and challenging tasks to evaluate

language models on, without requiring the need to use a coding environment to

simplify the benchmark setup.

| Prototype Ready? | Production-Ready? | Name | Categories | Eval | Description |

| --- | --- | --- | --- | --- | --- |

| Yes ✅| Not yet | GMCQ-easy | understanding | mcq | choose the correct code diff that closes a PR |

| Yes ✅| Not yet | GMCQ-hard | understanding | mcq | choose the correct code diff that closes a PR, with additions and removals |

| Yes ✅| Not yet| Reverse-QA | text generation | LLM-as-a-judge | generate an issue title and body given code diff |

| Yes ✅ | Not yet | MPR-Gen | code generation | LLM-as-a-judge | given a maksed section of a code diff, generate the code |

| Yes ✅ | Not yet | Reverse-QA-Hallu | hallucination detection | LLM-as-a-judge | uses an LLM-as-a-judge to determine whether the model hallucinated |

## Overall Results (v0.3) 🏆

| Model Name                   | GMCQ-Easy | GMCQ-Hard | MPR-Gen | Reverse-QA | Reverse-QA-Hallu | EFCB Score |

|------------------------------|----------|--------|-------------|--------|-----------|--------|

| together/llama-3.3-70b-turbo |    0.803 |  0.476 |   3.74 |  7.23   | 0.965 |   0.482 |

| openai/o4-mini               |    0.892 |  0.868 |    3.67 |   7.41 |    0.946 |    0.584 |

## Detailed Results (v0.3) 📊

### GMCQ-Easy (v0.3)

| Model Name                   | mastodon | indigo | cloudflared | duckdb | tailscale | chroma | unweighted average |

|------------------------------|----------|--------|-------------|--------|-----------|--------|--------------------|

| together/llama-3.3-70b-turbo |    0.912 |  0.699 |       0.845 |    0.8 |     0.759 |  0.801 |              0.803 |

| openai/o4-mini               |    0.975 |  0.869 |       0.872 |  0.893 |     0.824 |  0.918 |              0.892 |

| anthropic/claude-4-sonnet    |     0.95 |  0.801 |       0.851 |  0.856 |     0.786 |  0.862 |              0.851 |

### GMCQ-Hard (v0.3)

| Model Name                   | mastodon | indigo | cloudflared | duckdb | tailscale | chroma | unweighted average |

|------------------------------|----------|--------|-------------|--------|-----------|--------|--------------------|

| together/llama-3.3-70b-turbo |    0.452 |  0.392 |       0.574 |   0.47 |     0.465 |    0.5 |              0.476 |

| openai/o4-mini               |    0.883 |  0.824 |       0.919 |  0.879 |     0.834 |  0.867 |              0.868 |

### MPR-Gen (v0.3)

| Model Name                   | mastodon | indigo | cloudflared | duckdb | tailscale | chroma | unweighted average |

|------------------------------|----------|--------|-------------|--------|-----------|--------|--------------------|

| together/llama-3.3-70b-turbo |     7.48 |    6.9 |        6.88 |   7.18 |       7.5 |   7.45 |               7.23 |

| openai/o4-mini               |      7.5 |   7.35 |        7.49 |   7.29 |      7.73 |   7.11 |               7.41 |

### Reverse-QA (v0.3)

| Model Name                   | mastodon | indigo | cloudflared | duckdb | tailscale | chroma | unweighted average |

|------------------------------|----------|--------|-------------|--------|-----------|--------|--------------------|

| together/llama-3.3-70b-turbo |     3.94 |    3.4 |        4.01 |   3.97 |      3.84 |   3.28 |              3.740 |

| openai/o4-mini               |     4.26 |   3.36 |        4.07 |    3.4 |      3.74 |   3.18 |              3.668 |

### Reverse-QA-Hallu (v0.3)

| Model Name                   | mastodon | indigo | cloudflared | duckdb | tailscale | chroma | unweighted average |

|------------------------------|----------|--------|-------------|--------|-----------|--------|--------------------|

| together/llama-3.3-70b-turbo |    0.941 |  0.983 |       0.959 |  0.967 |     0.984 |  0.954 |              0.965 |

| openai/o4-mini               |    0.946 |  0.949 |       0.966 |  0.944 |     0.957 |  0.913 |              0.946 |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rootly-ai-labs/efcb

Awesome Lists containing this project

README

Environment-Free Coding Benchmark ⚗️