https://github.com/twangodev/devpost-hacks
Can LLMs judge hackathons?
https://github.com/twangodev/devpost-hacks
Last synced: 7 days ago
JSON representation
Can LLMs judge hackathons?
- Host: GitHub
- URL: https://github.com/twangodev/devpost-hacks
- Owner: twangodev
- Created: 2026-04-23T02:41:02.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-05T01:58:34.000Z (about 1 month ago)
- Last Synced: 2026-05-05T03:37:53.933Z (about 1 month ago)
- Language: Python
- Homepage:
- Size: 121 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Can LLMs Judge Hackathons?
[](https://huggingface.co/collections/twangodev/devpost-hacks)
[](report.pdf)
Evaluating whether LLMs can judge hackathons in agreement with human judges.
Scraping lives in a separate repo; this one covers enrichment, pairwise
judging, Bradley-Terry ranking, and dataset publishing.
We compare five judges:
1. **Qwen3.5-27B** — large reasoning teacher
2. **Qwen3.5-4B** — small open-weights reasoning peer
3. **Qwen3-4B-Instruct-2507** — the actual base under our LoRA
4. **twangodev/devpost-hacks-qwen3-4b-judge** — our 4B finetune distilled from the 27B teacher
5. **gpt-oss-20b** — cross-family open-weights baseline
Pairwise verdicts feed a Bradley-Terry model. Primary eval: top-1 agreement
with human winners on Devpost. Secondary: percentile correlation against the
private MadHacks Plackett-Luce ranking.
## Datasets
- [`twangodev/devpost-hacks`](https://huggingface.co/datasets/twangodev/devpost-hacks)
— 2,222 projects across 9 hackathons + GitHub READMEs
- [`twangodev/devpost-hacks-judgments`](https://huggingface.co/datasets/twangodev/devpost-hacks-judgments)
— SFT-format pairwise judgments from each of the five judges above; row's
`model` field disambiguates the source
## Pipeline
```bash
uv sync
devpost enrich -s .json # GitHub READMEs → projects parquet
devpost pairs -c --n 500 # sample pairs from the HF dataset
docker compose up -d sglang-qwen3.5-27b # one of: sglang-qwen3.5-{27b,4b}, sglang-qwen3-4b-{2507,judge}, sglang-gpt-oss-20b
devpost judge --pairs data//pairs.jsonl --model Qwen/Qwen3.5-27B # resumable; matches the running service
devpost rank -c # BT ranking + recall@K to stdout
devpost summary # cross-judge eval w/ bootstrap CIs and pairwise stats
devpost export projects | judgments # stage for HF upload
```
External: `gh` (README fetches), Docker (SGLang).
## Authors
Manas Joshi · Oliver Ohrt · Andrew Lou · Mari Garey · George Sukhotin · Ethan Feiges · Ben Friedl · James Ding