{"id":23933488,"url":"https://github.com/arcprize/arc-agi-benchmarking","last_synced_at":"2026-03-11T00:11:28.200Z","repository":{"id":259354675,"uuid":"877410068","full_name":"arcprize/arc-agi-benchmarking","owner":"arcprize","description":"Testing baseline LLMs performance across various models","archived":false,"fork":false,"pushed_at":"2026-01-30T17:35:28.000Z","size":1993,"stargazers_count":336,"open_issues_count":12,"forks_count":60,"subscribers_count":9,"default_branch":"main","last_synced_at":"2026-01-31T08:51:33.254Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arcprize.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2024-10-23T15:47:15.000Z","updated_at":"2026-01-23T16:33:58.000Z","dependencies_parsed_at":"2025-12-03T00:04:10.394Z","dependency_job_id":null,"html_url":"https://github.com/arcprize/arc-agi-benchmarking","commit_stats":null,"previous_names":["arcprizeorg/model_baseline","arcprize/model_baseline","arcprize/arc-agi-benchmarking"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/arcprize/arc-agi-benchmarking","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arcprize%2Farc-agi-benchmarking","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arcprize%2Farc-agi-benchmarking/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arcprize%2Farc-agi-benchmarking/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arcprize%2Farc-agi-benchmarking/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arcprize","download_url":"https://codeload.github.com/arcprize/arc-agi-benchmarking/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arcprize%2Farc-agi-benchmarking/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30362966,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-10T21:41:54.280Z","status":"ssl_error","status_checked_at":"2026-03-10T21:40:59.357Z","response_time":106,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-06T00:29:43.422Z","updated_at":"2026-03-11T00:11:28.187Z","avatar_url":"https://github.com/arcprize.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Building"],"sub_categories":["大语言对话模型及数据","LLM Models"],"readme":"# Testing systems with ARC-AGI\n\nRun ARC-AGI tasks against multiple model adapters (OpenAI, Anthropic, Gemini, Fireworks, Grok, OpenRouter, X.AI, custom etc.) with built-in rate limiting, retries, and scoring.\n\n## Quickstart\n0) Clone this repo:\n```bash\ngit clone https://github.com/arcprize/arc-agi-benchmarking.git\ncd arc-agi-benchmarking\n```\n\n1) Install (installs all adapters + SDKs):\n```bash\npip install .\n```\n\n2) Single-task dry run (no API keys) with the local `random-baseline` adapter:\n```bash\npython main.py \\\n  --data_dir data/sample/tasks \\\n  --config random-baseline \\\n  --task_id 66e6c45b \\\n  --save_submission_dir submissions/random-single \\\n  --log-level INFO\n```\n\n3) Run all bundled sample tasks with the random solver:\n```bash\npython cli/run_all.py \\\n  --config random-baseline \\\n  --data_dir data/sample/tasks \\\n  --save_submission_dir submissions/random-baseline-sample \\\n  --log-level INFO\n```\n\n4) Score the outputs you just generated:\n```bash\npython src/arc_agi_benchmarking/scoring/scoring.py \\\n  --task_dir data/sample/tasks \\\n  --submission_dir submissions/random-baseline-sample \\\n  --results_dir results/random-baseline-sample\n```\n\nIf using the random solver, expect all the attempts to be incorrect.\n\nIf you want to run real models, change the `config` and add the corresponding API keys (see Data and Config sections below).\n\n## Data\n\nRather than using the sample data in `data/sample/tasks/`, you can use the real ARC-AGI tasks from the following repositories:\n\n* ARC-AGI-1 (2019): `git clone https://github.com/fchollet/ARC-AGI.git data/arc-agi`\n* ARC-AGI-2 (2025): `git clone https://github.com/arcprize/ARC-AGI-2.git data/arc-agi`\n\n## CLI parameters\n- `--data_dir`: Folder containing ARC task `.json` files (e.g., `data/sample/tasks`).\n- `--config`: Model config name from `models.yml`. Used by both single-task and batch.\n- `--save_submission_dir`: Where to write outputs. Use the same flag for single-task and batch (alias: `--submissions-root` remains for backward compatibility). Recommended structure: `\u003csave_submission_dir\u003e/\u003cconfig\u003e/\u003cversion\u003e/\u003ceval_type\u003e/`, ex: `submissions/gpt-4o-2024-11-20/v1/public_eval/`.\n- `--num_attempts`: How many attempts per test pair (per task).\n- `--retry_attempts`: Internal retries within an attempt if the provider call fails.\n- `--log-level`: `DEBUG|INFO|WARNING|ERROR|CRITICAL|NONE`.\n- `--enable-metrics`: Toggle metrics collection (saved in `metrics_output/`).\n- Scoring-specific:\n  - `--submission_dir`: Where your run wrote outputs\n  - `--results_dir` Where to write aggregated metrics/results\n\n## Running models\nFor runs beyond the Quickstart:\n- Batch (recommended): `python cli/run_all.py` with your task list, model config, data dir, submission dir, attempts/retries, and log level. Uses asyncio, provider rate limiting, and tenacity retries; outputs land in `--save_submission_dir` (e.g., `submissions/\u003cconfig\u003e/\u003cversion\u003e/\u003ceval_type\u003e`). `run_all` handles one model config per invocation; run multiple configs by invoking it multiple times (see `run_all_configs_local.sh` for a pattern).\n- Single task (debug): `python main.py` with a single `--config`, `--task_id`, and your data dir/save directory and log level.\nSee the CLI parameters section for flag details.\n\n## Configuring models and providers\nTests are run based on model configs. Model configs hold the configuration (max output tokens, temperature, pricing etc.) for each test.\n\nModel configs live in `src/arc_agi_benchmarking/models.yml`. Example:\n  ```yaml\n  - name: \"gpt-4o-2024-11-20\"   # config name you reference on the CLI; typically includes the reasoning level for clarity (e.g., \"-basic\", \"-advanced\")\n    model_name: \"gpt-4o-2024-11-20\"  # provider’s actual model id\n    provider: \"openai\"         # must match an adapter\n    max_output_tokens: 4096    # optional; provider-specific\n    temperature: 0.0           # optional; provider-specific\n    pricing:\n      date: \"2024-11-20\"\n      input: 5.00              # USD per 1M input tokens\n      output: 15.00            # USD per 1M output tokens\n  ```\n  - Standard fields: `name`, `model_name`, `provider`, `pricing` (`input`/`output` per 1M tokens, `date` for traceability).\n  - Provider kwargs: any extra keys become `kwargs` and are passed directly to the SDK (e.g., `temperature`, `max_output_tokens`, `stream`, etc.).\n- Rate limits live in `provider_config.yml` (`rate`, `period` per provider).\n- Environment: set provider keys (e.g., `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GEMINI_API_KEY`, `HUGGING_FACE_API_KEY`). Copy `.env.example` to `.env` and fill in.\n\n### Testing a new model\n\n1. Add a new model config: add an entry to `models.yml` with an existing provider; then use `--config \u003cname\u003e` on the CLI\n\n2. If you're adding a new adapter:\n    1. Create `src/arc_agi_benchmarking/adapters/\u003cprovider\u003e.py` implementing `ProviderAdapter`\n    2. Export it from `src/arc_agi_benchmarking/adapters/__init__.py`\n    3. Add a branch in `main.py` (and any factories) so the provider name is recognized\n    4. Add a config entry in `models.yml` pointing to `provider: \"\u003cprovider\u003e\"`\n    5. [Optional] Add tests (adapters and parsing) to cover basic flows\n\n## Scoring\n\nTo score a run you'll need 1) your test's submission directory and 2) the source taskset (which contains the solutions)\n\nScore a run:  \n\n```bash\npython src/arc_agi_benchmarking/scoring/scoring.py\n  --task_dir \u003cdata_dir\u003e/data/evaluation\n  --submission_dir submissions/\u003cconfig\u003e\n  --results_dir results/\u003cconfig\u003e\n```\n\n## Contributing and testing\n- Add new providers/models in `src/arc_agi_benchmarking/adapters` and `models.yml`.\n- Run tests: `pytest`.\n- Use the bundled sample task + submission for quick scoring checks.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farcprize%2Farc-agi-benchmarking","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farcprize%2Farc-agi-benchmarking","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farcprize%2Farc-agi-benchmarking/lists"}