{"id":46992816,"url":"https://github.com/approximatelabs/pencil-puzzle-bench","last_synced_at":"2026-03-11T14:01:12.861Z","repository":{"id":343356896,"uuid":"1177203876","full_name":"approximatelabs/pencil-puzzle-bench","owner":"approximatelabs","description":null,"archived":false,"fork":false,"pushed_at":"2026-03-10T00:54:35.000Z","size":2648,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-10T08:22:03.149Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/approximatelabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-09T19:53:59.000Z","updated_at":"2026-03-10T00:53:07.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/approximatelabs/pencil-puzzle-bench","commit_stats":null,"previous_names":["approximatelabs/pencil-puzzle-bench"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/approximatelabs/pencil-puzzle-bench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/approximatelabs%2Fpencil-puzzle-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/approximatelabs%2Fpencil-puzzle-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/approximatelabs%2Fpencil-puzzle-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/approximatelabs%2Fpencil-puzzle-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/approximatelabs","download_url":"https://codeload.github.com/approximatelabs/pencil-puzzle-bench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/approximatelabs%2Fpencil-puzzle-bench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30383109,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-11T13:58:07.161Z","status":"ssl_error","status_checked_at":"2026-03-11T13:58:06.476Z","response_time":84,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-03-11T14:00:43.881Z","updated_at":"2026-03-11T14:01:12.853Z","avatar_url":"https://github.com/approximatelabs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pencil Puzzle Bench\n\nA benchmark for evaluating LLM reasoning through pencil puzzles — constraint-satisfaction problems closely related to NP-complete problems — with deterministic, step-level verification.\n\n**Paper:** [Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning](https://arxiv.org/abs/2603.02119)\n\n![Model success rates by strategy (left) with puzzle solve gallery (right)](assets/figure1.png)\n\n## Features\n\n- **62,000+ puzzles** across **94 puzzle types** sourced from [puzz.link](https://puzz.link), each with a unique solution verified by [cspuz-solver2](https://github.com/semiexp/cspuz-solver2) (SAT-based constraint solver)\n- **Step-level verification** via [pzpr.js](https://github.com/robx/pzprjs) — every intermediate board state is checked against variety-specific constraints, localizing errors to the exact rule violated (e.g., \"Two shaded cells are adjacent\" in Nurikabe, \"Loop crosses itself\" in Slitherlink)\n- **Dense reward signals** — per-move constraint checking enables process supervision and reinforcement learning\n- **Gymnasium environment** for RL training\n- **Verifiers environment** for GRPO training with [verifiers](https://github.com/PrimeIntellect-ai/verifiers)\n- **Benchmark harness** with pluggable strategies, built on [pydantic-ai](https://ai.pydantic.dev/)\n- **Local model support** — works with LM Studio, ollama, vLLM, or any OpenAI-compatible endpoint\n\n## Install\n\n### Prerequisites\n\n**Node.js** is required — the puzzle engine (pzpr.js) runs in a Node.js subprocess via [JSPyBridge](https://github.com/nicedouble/JSPyBridge).\n\n```bash\n# macOS\nbrew install node\n\n# Ubuntu/Debian\napt install nodejs npm\n\n# Or use nvm\nnvm install 20\n```\n\n### Install ppbench\n\n```bash\npip install ppbench          # Core: puzzles, gym env, pydantic-ai framework\npip install ppbench[all]     # + OpenAI and Anthropic API clients\n```\n\nInstall only the providers you need:\n\n```bash\npip install ppbench[openai]      # + OpenAI client\npip install ppbench[anthropic]   # + Anthropic client\n```\n\nLocal models (LM Studio, ollama, vLLM) work with just the base install — no provider extras needed.\n\n### Docker\n\nMinimal Dockerfile for a clean environment:\n\n```dockerfile\nFROM python:3.12-slim\nRUN apt-get update \u0026\u0026 apt-get install -y --no-install-recommends nodejs npm curl \\\n    \u0026\u0026 rm -rf /var/lib/apt/lists/*\nRUN curl -LsSf https://astral.sh/uv/install.sh | sh\nENV PATH=\"/root/.local/bin:$PATH\"\nWORKDIR /app\nCOPY . .\nRUN uv sync --all-extras\n```\n\n```bash\ndocker build -t ppbench .\ndocker run --rm ppbench uv run python -c \\\n  \"from ppbench import Puzzle, load_dataset; print(len(load_dataset('golden_30')), 'puzzles')\"\n```\n\nSee [`Dockerfile.test`](Dockerfile.test) for a full smoke test.\n\n## Quick Start\n\n```python\nfrom ppbench import Puzzle, load_dataset\n\n# Load a puzzle from the benchmark\nrecords = load_dataset(\"golden\")  # 300 curated puzzles\nrecord = records[0]\n\n# Create and interact with a puzzle\npuzzle = Puzzle.from_url(record[\"puzzlink_url\"])\nprint(puzzle.pid)           # e.g., \"sudoku\"\nprint(puzzle.get_state())   # board state as text\n\n# Apply moves and check\npuzzle.send_move(\"mouse,left,3,5\")\nviolations = puzzle.check()      # [] if valid\nsolved = puzzle.is_complete()    # True when solved\n\n# Render as SVG\nsvg = puzzle.svg()\n```\n\n## Gymnasium Environment\n\n```python\nfrom ppbench import PuzzleEnv, load_dataset\n\nrecords = load_dataset(\"golden\")\nenv = PuzzleEnv(puzzle_url=records[0][\"puzzlink_url\"])\nobs, info = env.reset()\n\n# Standard Gymnasium loop\nobs, reward, terminated, truncated, info = env.step(\"mouse,left,3,5\")\n# reward = 1.0 when puzzle is solved\n```\n\n## Verifiers Environment (GRPO Training)\n\n```python\nfrom ppbench.verifiers_env import load_environment\n\nenv = load_environment(\"golden\")\n# Use with verifiers GRPO training pipeline\n```\n\n## Datasets\n\n| Name | Size | Description |\n|------|------|-------------|\n| `golden` / `golden_300` | 300 puzzles | Curated benchmark (20 types × 15 each), bundled |\n| `golden_30` | 30 puzzles | Small subset for expensive agentic strategies, bundled |\n| `full` | 62,231 puzzles | All 94 puzzle types ([HuggingFace](https://huggingface.co/datasets/bluecoconut/pencil-puzzle-bench)) |\n\n```python\nfrom ppbench import load_dataset\n\n# Bundled datasets (no download needed)\nrecords = load_dataset(\"golden\")      # 300 puzzles\nrecords = load_dataset(\"golden_30\")   # 30 puzzles\n```\n\n### Full dataset\n\nDownload from HuggingFace (one JSONL file):\n\n```bash\n# Using the huggingface-cli\npip install huggingface-hub\nhuggingface-cli download bluecoconut/pencil-puzzle-bench \\\n    full_dataset.jsonl \\\n    --repo-type dataset \\\n    --local-dir ppbench/data\n```\n\nThen load it:\n\n```python\nrecords = load_dataset(\"full\")  # 62,231 puzzles\n```\n\nEach record contains:\n- `puzzlink_url` — canonical puzzle URL (encodes the puzzle state)\n- `pid` — puzzle type (e.g., `\"sudoku\"`, `\"slither\"`, `\"tapa\"`)\n- `number_required_moves` — minimum moves to solve\n- `solution` — decoded solution with `moves_full`, `moves_required`, `moves_hint`\n\n## Running the Benchmark\n\n```bash\n# Set API keys for the providers you want to use\nexport OPENAI_API_KEY=...\nexport ANTHROPIC_API_KEY=...\n\n# Quick test: 1 puzzle, both strategies\nuv run python -u examples/quick_test.py\n\n# Multi-model comparison\nuv run python -u examples/multi_model.py\n\n# Sweep an entire dataset\nuv run python -u examples/dataset_sweep.py\n\n# Analyze results\nuv run python -u examples/analyze_results.py\n```\n\nResults are cached per (model, strategy, puzzle) — re-runs skip completed work.\n\n### Using a local model\n\nPoint the benchmark at any OpenAI-compatible endpoint (LM Studio, ollama, vLLM, etc.):\n\n```bash\n# Default: http://127.0.0.1:1234/v1 (LM Studio default)\nexport LOCAL_API_BASE=http://127.0.0.1:1234/v1\n\n# Or for ollama:\nexport LOCAL_API_BASE=http://127.0.0.1:11434/v1\n```\n\n```python\nimport asyncio\nfrom ppbench.benchmarks import run, DirectAskStrategy\n\nasyncio.run(run(\n    models=[\"local/qwen3.5-35b-a3b\"],\n    strategies=[DirectAskStrategy],\n    dataset=\"golden_30\",\n))\n```\n\nThe model name after `local/` is passed directly to the server — use whatever model name your server expects.\n\n## Architecture Guide\n\n### Core primitive\n\n`ppbench.Puzzle` wraps a headless [pzpr.js](https://github.com/robx/pzprjs) puzzle instance running in Node.js. pzpr.js is the engine behind the [puzz.link](https://puzz.link) puzzle community — it implements 100+ puzzle varieties with full rule checking, error localization, and completion detection. You send moves, check the board against variety-specific constraints, and verify completeness — all deterministically, no browser needed.\n\nThe benchmark harness uses [pydantic-ai](https://ai.pydantic.dev/) to build LLM agents that interact with puzzles.\n\n### Models\n\nModels use `provider/model-name@variant` syntax, parsed by `ppbench/benchmarks/model_list.py`:\n\n| Provider | Example | Notes |\n|----------|---------|-------|\n| `openai` | `openai/gpt-4o` | Direct OpenAI API |\n| `openai` | `openai/gpt-5.2@medium` | Responses API with reasoning effort |\n| `anthropic` | `anthropic/claude-sonnet-4-6` | Direct Anthropic API |\n| `anthropic` | `anthropic/claude-opus-4-6@thinking` | Extended thinking |\n| `google` | `google/gemini-3-pro` | Gemini API |\n| `xai` | `xai/grok-4-1-fast` | xAI API (OpenAI-compatible) |\n| `openrouter` | `openrouter/deepseek/deepseek-v3.2` | OpenRouter (OpenAI-compatible) |\n| `local` | `local/my-model` | Any local OpenAI-compatible server |\n\nEach provider maps to a [pydantic-ai model class](https://ai.pydantic.dev/models/). To add a new provider, add a `_build_*` function in `ppbench/benchmarks/model_list.py`.\n\n### Strategies\n\nA strategy defines **what** the agent does. The harness handles execution, retries, usage tracking, and caching.\n\nSubclass `ppbench.benchmarks.Strategy` and implement two methods:\n\n```python\nfrom ppbench.benchmarks import Strategy, AgentConfig, StrategyResult\nfrom pydantic_ai import Agent\nfrom ppbench import Puzzle\n\nclass MyStrategy(Strategy):\n    requires_tools = False  # True if your agent uses tool calling\n\n    def build_agent(self, puzzle, model_obj, model_name):\n        \"\"\"Create the agent and prompt. No execution happens here.\"\"\"\n        agent = Agent(model_obj, system_prompt=\"Solve this puzzle...\")\n        prompt = f\"Puzzle: {puzzle.get_string_repr()}\"\n        return AgentConfig(agent=agent, prompt=prompt)\n\n    def extract_result(self, puzzle, deps, output):\n        \"\"\"Interpret the agent's output. Replay moves, check success.\"\"\"\n        moves = parse_moves_somehow(output)\n        fresh = Puzzle.from_url(puzzle.url)\n        for m in moves:\n            fresh.send_move(m)\n        return StrategyResult(\n            is_success=fresh.isComplete(),\n            parsed_moves=moves,\n            raw_output=output,\n        )\n```\n\nKey concepts:\n- `build_agent()` returns an `AgentConfig` with a [pydantic-ai Agent](https://ai.pydantic.dev/agents/), a prompt, and optional deps\n- `extract_result()` replays moves on a fresh puzzle to verify the solution\n- `on_node()` is an optional per-step hook (compactification, progress tracking, etc.)\n- `strategy_id` is a hash of your strategy's source — harness changes don't invalidate the cache\n\nBuilt-in strategies for reference:\n- [`direct_ask.py`](ppbench/benchmarks/strategies/direct_ask.py) — single-shot, no tools (simplest)\n- [`basic_agentic.py`](ppbench/benchmarks/strategies/basic_agentic.py) — tool-calling agent with make_move, check_board, reset\n\n### The `run()` API\n\n```python\nimport asyncio\nfrom ppbench.benchmarks import run, DirectAskStrategy, BasicAgenticSolve\n\nresults = asyncio.run(run(\n    models=[\"openai/gpt-4o\", \"local/qwen3.5-35b\"],\n    strategies=[DirectAskStrategy, BasicAgenticSolve],\n    dataset=\"golden_30\",       # or \"golden\", \"golden_300\"\n    puzzle_types=[\"tapa\"],     # optional: filter by puzzle type\n    n_puzzles=5,               # optional: limit count\n    concurrency=10,            # max concurrent tasks\n    seed=42,                   # reproducible puzzle sampling\n))\n```\n\nResults are saved as JSONL index + JSON artifacts in `output/runs/`. See [`examples/analyze_results.py`](examples/analyze_results.py) for how to load and inspect them.\n\n## Move Format\n\nPuzzles use pzpr.js input commands:\n\n| Move | Description |\n|------|-------------|\n| `mouse,left,x,y` | Left click at (x,y) |\n| `mouse,right,x,y` | Right click at (x,y) |\n| `mouse,left,x1,y1,x2,y2` | Drag from (x1,y1) to (x2,y2) |\n| `mouse,leftx2,x,y` | Double left-click at (x,y) |\n| `mouse,rightx2,x,y` | Double right-click at (x,y) |\n| `key,1` | Press key '1' |\n\n## License\n\nMIT\n\n## Citation\n\n```bibtex\n@article{waugh2026ppbench,\n    title={Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning},\n    author={Justin Waugh},\n    year={2026},\n    eprint={2603.02119},\n    archivePrefix={arXiv},\n    primaryClass={cs.AI},\n    url={https://arxiv.org/abs/2603.02119}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapproximatelabs%2Fpencil-puzzle-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapproximatelabs%2Fpencil-puzzle-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapproximatelabs%2Fpencil-puzzle-bench/lists"}