{"id":50505552,"url":"https://github.com/kudosscience/cognitive-ability-benchmark","last_synced_at":"2026-06-02T15:31:28.653Z","repository":{"id":351536510,"uuid":"1185510219","full_name":"kudosscience/cognitive-ability-benchmark","owner":"kudosscience","description":null,"archived":false,"fork":false,"pushed_at":"2026-04-15T13:27:21.000Z","size":171,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-15T13:35:12.934Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kudosscience.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-18T16:54:38.000Z","updated_at":"2026-04-15T13:27:55.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/kudosscience/cognitive-ability-benchmark","commit_stats":null,"previous_names":["kudosscience/cognitive-ability-benchmark"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/kudosscience/cognitive-ability-benchmark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kudosscience%2Fcognitive-ability-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kudosscience%2Fcognitive-ability-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kudosscience%2Fcognitive-ability-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kudosscience%2Fcognitive-ability-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kudosscience","download_url":"https://codeload.github.com/kudosscience/cognitive-ability-benchmark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kudosscience%2Fcognitive-ability-benchmark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33829340,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-02T02:00:07.132Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-02T15:31:27.387Z","updated_at":"2026-06-02T15:31:28.647Z","avatar_url":"https://github.com/kudosscience.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"﻿# CogniFlex: Executive Functions Benchmark for AGI Evaluation\n\n## Kaggle Writeup\n\n### Project Name\n\nCogniFlex: Habit Override, Rule Shift, and Conflict Planning\n\n### Your Team\n\n- Team: Kudos Science\n- Repository: kudosscience/cognitive-ability-benchmark\n\n### Problem Statement\n\nCurrent LLM evaluations often over-reward memorized pattern completion and under-measure executive control. This creates a major gap for AGI-oriented assessment: a model may look strong on broad reasoning benchmarks while still failing to inhibit habitual responses, adapt under rule changes, or plan under conflicting constraints.\n\nThis project targets the Executive Functions track with one central question:\n\nWhat does model behavior look like when the highest-probability next token is usually wrong?\n\nThe benchmark isolates three executive sub-capabilities:\n\n- Inhibitory control: suppressing familiar continuations when explicit local overrides appear\n- Cognitive flexibility: adapting to mid-task semantic rule shifts\n- Complex planning: avoiding tempting but irreversible trap actions to reach a constrained goal\n\n### Task \u0026 benchmark construction\n\nCogniFlex is implemented as three deterministic task generators with exact-match scoring:\n\n- habit_override\n  - Environment: alphabet ring traversal with explicit per-node inverse rules\n  - Failure mode tested: perseverative forward stepping despite override instructions\n  - Scoring: normalized exact path match\n- rule_shift\n  - Environment: arithmetic/bitwise operation chains with a defined semantic shift point\n  - Failure mode tested: continuing pre-shift semantics after context update\n  - Scoring: integer extraction plus exact numeric match\n- conflict_planning\n  - Environment: resource conversion graph with a unique shortest valid plan and decoy traps\n  - Failure mode tested: greedy local choice that consumes critical resources and dead-ends\n  - Scoring: normalized exact action sequence match\n\nBenchmark composition approach:\n\n- Programmatic generation only (no manual labels)\n- Difficulty levels 1..5 with controlled scaling\n- Seeded reproducibility for every sample\n- Verifiable ground truth per sample\n- Explicit output formats in prompts to reduce grader ambiguity\n- Public data and private scoring keys are separated to support held-out evaluation\n- Public datasets are answer-free and metadata-sanitized\n- Official scoring uses signed private bundles and public-data integrity checks\n\nKaggle integration artifacts were implemented and generated:\n\n- Adapter code: benchmark/adapters/kaggle_benchmarks_adapter.py\n- Task notebook scaffold: kaggle_assets/notebooks/cogniflex_task.py\n- Task datasets: kaggle_assets/data/*.jsonl\n- Benchmark metadata: kaggle_assets/benchmark_metadata.json\n- Export script: scripts/export_kaggle_assets.py\n- Private bundle export script (organizer-only): scripts/export_private_answer_bundle.py\n- Official scoring script (organizer-only): scripts/score_submission.py\n- Adapter run config: configs/kaggle_adapter.json\n\n### Dataset\n\nData is fully synthetic and generated on demand from benchmark/utils/generator.py through CogniFlexSuite.\n\nCurrent benchmark export configuration:\n\n- samples per task: 250\n- tasks: 3\n- total samples: 750\n- seed: 20260415\n\nEach public JSONL row contains:\n\n- sample_id (int)\n- prompt (str)\n- difficulty (int in [1, 5])\n- metadata (dict, sanitized task context only)\n\nPublic task-specific metadata examples:\n\n- habit_override: start_letter, override_letters, override_delta, steps\n- rule_shift: initial_state, shift_after_index, operations\n- conflict_planning: start_inventory, actions (without trap/canonical labels)\n\nPrivate held-out answers are exported separately into a signed private bundle and are not shipped with the public benchmark assets.\n\nStatistical significance and defensibility notes:\n\n- Difficulty-balanced generation is cyclical by index\n- Every item has deterministic reconstruction through seed + generator logic\n- Conflict-planning generator includes tests verifying unique shortest solutions\n\n### Technical details\n\nCore modules:\n\n- benchmark/cogniflex_suite.py\n- benchmark/tasks/base_task.py\n- benchmark/tasks/habit_override.py\n- benchmark/tasks/rule_shift.py\n- benchmark/tasks/conflict_planning.py\n- benchmark/utils/generator.py\n- benchmark/adapters/kaggle_benchmarks_adapter.py\n- benchmark/evaluation/pilot_sweep.py\n- benchmark/evaluation/secure_evaluator.py\n\nPilot sweep implementation:\n\n- Script: scripts/run_pilot_sweep.py\n- Config: configs/pilot_sweep.json\n- Outputs:\n  - outputs/pilot_sweep_predictions.jsonl\n  - outputs/pilot_sweep_task_summary.csv\n  - outputs/pilot_sweep_difficulty_summary.csv\n  - outputs/pilot_sweep_overall.csv\n  - outputs/pilot_sweep_report.md\n\nThe sweep runner simulates a model ladder (small -\u003e medium -\u003e large -\u003e frontier -\u003e oracle upper bound) with task-wise skill and difficulty sensitivity controls. A tuned `reasoning-frontier-xl` profile was added to better mirror expected Kaggle model-family separation. This gives a tunable, reproducible early-signal for discriminatory power before expensive model runs.\n\nReproducibility commands:\n\n```bash\npython scripts/export_kaggle_assets.py\nset COGNIFLEX_SIGNING_KEY=your_signing_key_here\npython scripts/export_private_answer_bundle.py\npython scripts/run_pilot_sweep.py\npython -m pytest tests/\n```\n\n### Security architecture for official scoring\n\nCogniFlex now uses a split public/private evaluation design:\n\n- Public release bundle (participant-visible): prompts, difficulty, and sanitized metadata only.\n- Private organizer bundle (not shipped): held-out canonical answers, dataset hashes, and HMAC signature.\n- Official scorer: `scripts/score_submission.py` verifies signed private bundle and dataset hashes before scoring.\n\nScoring hardening details:\n\n- Strict JSONL submission schema (`task_name`, `sample_id`, `prediction`) with duplicate-key rejection.\n- Unicode normalization and zero-width/control-character stripping before parsing.\n- Task-specific canonical parsers with strict grammar:\n  - Habit Override: exact `A \u003e B \u003e ...` tokenized path format.\n  - Rule Shift: strict integer-only output (no free-form text extraction).\n  - Conflict Planning: strict action-ID CSV with duplicate-action rejection.\n- Unknown sample IDs or task IDs are rejected.\n- Missing predictions score `0.0`.\n\n### Results, insights, and conclusions\n\nPilot sweep overall scores (750 samples per model):\n\n| Model | Overall Score |\n| --- | ---: |\n| pattern-matcher-small | 0.241 |\n| rule-aware-medium | 0.480 |\n| planner-large | 0.669 |\n| reasoning-frontier-xl | 0.844 |\n| oracle-upper-bound | 0.993 |\n\nTask-level means:\n\n| Model | Habit Override | Rule Shift | Conflict Planning |\n| --- | ---: | ---: | ---: |\n| pattern-matcher-small | 0.348 | 0.232 | 0.144 |\n| rule-aware-medium | 0.572 | 0.488 | 0.380 |\n| planner-large | 0.684 | 0.696 | 0.628 |\n| reasoning-frontier-xl | 0.848 | 0.892 | 0.792 |\n| oracle-upper-bound | 0.984 | 0.996 | 1.000 |\n\nObserved signal quality:\n\n- Strong five-tier rank ordering across capability tiers indicates useful discriminatory power\n- Difficulty degradation is visible, especially for smaller profiles on rule_shift and conflict_planning\n- No ceiling effect for realistic profiles, and no floor collapse for all models at all levels\n\nWhat this benchmark reveals beyond generic reasoning tests:\n\n- Models that appear competent on easy symbolic tasks can still fail sharply under late rule reversals\n- Planning performance collapses faster than arithmetic flexibility when trap actions are introduced\n- Executive control errors are systematic and classifiable, not merely random hallucinations\n\nConclusion:\n\nCogniFlex produces interpretable gradients and explicit failure modes tied to executive-function constructs, making it a robust candidate for AGI progress profiling on this track.\n\n### Organizational affiliations\n\n- Submitted by Kudos Science\n- No additional institutional affiliation declared in this repository\n\n### Final submission checklist mapped to Kaggle form fields\n\n| Kaggle form field | Submission value | Source in repository |\n| --- | --- | --- |\n| Project Name | CogniFlex: Habit Override, Rule Shift, and Conflict Planning | This README, Project Name section |\n| Team / Author | Kudos Science | This README, Your Team section |\n| Track | Executive Functions | This README, Problem Statement section |\n| Problem Statement | Executive-control gap in current LLM evaluation | This README, Problem Statement section |\n| Task / Benchmark Construction | Deterministic generators + exact-match scorers | This README, Task \u0026 benchmark construction section |\n| Dataset Description | 3 synthetic tasks, 250 samples each, seeded reproducibility | This README, Dataset section |\n| Technical Details | Core modules + scripts + reproducibility commands | This README, Technical details section |\n| Results / Insights / Conclusions | Five-model pilot ladder + task-level breakdown | This README, Results section and outputs/pilot_sweep_report.md |\n| Organizational Affiliations | Submitted by Kudos Science | This README, Organizational affiliations section |\n| References \u0026 Citations | DeepMind + Kaggle docs/SDK sources | This README, References \u0026 citations section |\n\nPre-submit completion checklist:\n\n- [ ] Paste Project Name exactly as shown above.\n- [ ] Confirm Team name is `Kudos Science`.\n- [ ] Select Executive Functions track.\n- [ ] Paste Problem Statement and Task Construction sections.\n- [ ] Confirm dataset counts match: 750 total samples (250 per task).\n- [ ] Attach notebook and dataset files from the release manifest below.\n- [ ] Paste updated Results table after final Kaggle model runs.\n- [ ] Verify references are present and links resolve.\n- [ ] Confirm public datasets do not contain `expected_output`.\n- [ ] Confirm private answer bundle is generated outside the public release package.\n- [ ] Confirm signed-bundle scoring passes integrity checks before final submission.\n\n### Release package manifest (exact files to attach/upload)\n\nRequired public benchmark assets:\n\n- kaggle_assets/benchmark_metadata.json\n- kaggle_assets/notebooks/cogniflex_task.py\n- kaggle_assets/data/habit_override.jsonl\n- kaggle_assets/data/rule_shift.jsonl\n- kaggle_assets/data/conflict_planning.jsonl\n\nOrganizer-only assets (do not ship publicly):\n\n- outputs/private/private_answer_key.json (generated by scripts/export_private_answer_bundle.py)\n\nRequired writeup evidence artifacts:\n\n- outputs/pilot_sweep_overall.csv\n- outputs/pilot_sweep_task_summary.csv\n- outputs/pilot_sweep_difficulty_summary.csv\n- outputs/pilot_sweep_report.md\n- README.md\n\nOptional reproducibility bundle:\n\n- configs/kaggle_adapter.json\n- configs/pilot_sweep.json\n- scripts/export_kaggle_assets.py\n- scripts/export_private_answer_bundle.py\n- scripts/score_submission.py\n- scripts/run_pilot_sweep.py\n\n### Security principle compliance\n\n| Principle | Implementation status | Where implemented |\n| --- | --- | --- |\n| Strict sandbox isolation between agent and evaluator | Implemented in architecture (separate organizer scorer and private bundle) | benchmark/evaluation/secure_evaluator.py, scripts/score_submission.py |\n| No reference answers in task configs | Implemented | benchmark/adapters/kaggle_benchmarks_adapter.py, kaggle_assets/data/*.jsonl |\n| Robust, non-injectable input parsing | Implemented | benchmark/evaluation/secure_evaluator.py |\n| Sanitized inputs to LLM judges | Implemented via sanitization pipeline and deterministic scoring path | benchmark/evaluation/secure_evaluator.py, kaggle_assets/notebooks/cogniflex_task.py |\n| Adversarial pre-publication testing | Implemented | tests/test_secure_evaluator.py |\n| Tamper-proof evaluation data | Implemented via dataset hashes + signed private bundle | benchmark/evaluation/secure_evaluator.py |\n| Scoring mechanisms resilient to output manipulation | Implemented via strict canonical parsers and schema checks | benchmark/evaluation/secure_evaluator.py |\n| Secret held-out answers (not shipped with benchmark) | Implemented | scripts/export_private_answer_bundle.py, .gitignore |\n\n### References \u0026 citations\n\n1. DeepMind. Measuring progress toward AGI: A cognitive framework.\n2. Kaggle Benchmarks documentation: [https://www.kaggle.com/docs/benchmarks](https://www.kaggle.com/docs/benchmarks)\n3. Kaggle Benchmarks SDK repository: [https://github.com/Kaggle/kaggle-benchmarks](https://github.com/Kaggle/kaggle-benchmarks)\n4. Kaggle Benchmarks Cookbook: [https://github.com/Kaggle/kaggle-benchmarks/blob/ci/cookbook.md](https://github.com/Kaggle/kaggle-benchmarks/blob/ci/cookbook.md)\n5. Kaggle Benchmarks Quick Start: [https://github.com/Kaggle/kaggle-benchmarks/blob/ci/quick_start.md](https://github.com/Kaggle/kaggle-benchmarks/blob/ci/quick_start.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkudosscience%2Fcognitive-ability-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkudosscience%2Fcognitive-ability-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkudosscience%2Fcognitive-ability-benchmark/lists"}