https://github.com/approximatelabs/pencil-puzzle-bench

Last synced: 3 months ago
JSON representation
Host: GitHub
URL: https://github.com/approximatelabs/pencil-puzzle-bench
Owner: approximatelabs
License: mit
Created: 2026-03-09T19:53:59.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-03-10T00:54:35.000Z (3 months ago)
Last Synced: 2026-03-10T08:22:03.149Z (3 months ago)
Language: Python
Size: 2.53 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # Pencil Puzzle Bench

A benchmark for evaluating LLM reasoning through pencil puzzles — constraint-satisfaction problems closely related to NP-complete problems — with deterministic, step-level verification.

**Paper:** [Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning](https://arxiv.org/abs/2603.02119)

![Model success rates by strategy (left) with puzzle solve gallery (right)](assets/figure1.png)

## Features

- **62,000+ puzzles** across **94 puzzle types** sourced from [puzz.link](https://puzz.link), each with a unique solution verified by [cspuz-solver2](https://github.com/semiexp/cspuz-solver2) (SAT-based constraint solver)

- **Step-level verification** via [pzpr.js](https://github.com/robx/pzprjs) — every intermediate board state is checked against variety-specific constraints, localizing errors to the exact rule violated (e.g., "Two shaded cells are adjacent" in Nurikabe, "Loop crosses itself" in Slitherlink)

- **Dense reward signals** — per-move constraint checking enables process supervision and reinforcement learning

- **Gymnasium environment** for RL training

- **Verifiers environment** for GRPO training with [verifiers](https://github.com/PrimeIntellect-ai/verifiers)

- **Benchmark harness** with pluggable strategies, built on [pydantic-ai](https://ai.pydantic.dev/)

- **Local model support** — works with LM Studio, ollama, vLLM, or any OpenAI-compatible endpoint

## Install

### Prerequisites

**Node.js** is required — the puzzle engine (pzpr.js) runs in a Node.js subprocess via [JSPyBridge](https://github.com/nicedouble/JSPyBridge).

```bash

# macOS

brew install node

# Ubuntu/Debian

apt install nodejs npm

# Or use nvm

nvm install 20

```

### Install ppbench

```bash

pip install ppbench          # Core: puzzles, gym env, pydantic-ai framework

pip install ppbench[all]     # + OpenAI and Anthropic API clients

```

Install only the providers you need:

```bash

pip install ppbench[openai]      # + OpenAI client

pip install ppbench[anthropic]   # + Anthropic client

```

Local models (LM Studio, ollama, vLLM) work with just the base install — no provider extras needed.

### Docker

Minimal Dockerfile for a clean environment:

```dockerfile

FROM python:3.12-slim

RUN apt-get update && apt-get install -y --no-install-recommends nodejs npm curl \

    && rm -rf /var/lib/apt/lists/*

RUN curl -LsSf https://astral.sh/uv/install.sh | sh

ENV PATH="/root/.local/bin:$PATH"

WORKDIR /app

COPY . .

RUN uv sync --all-extras

```

```bash

docker build -t ppbench .

docker run --rm ppbench uv run python -c \

  "from ppbench import Puzzle, load_dataset; print(len(load_dataset('golden_30')), 'puzzles')"

```

See [`Dockerfile.test`](Dockerfile.test) for a full smoke test.

## Quick Start

```python

from ppbench import Puzzle, load_dataset

# Load a puzzle from the benchmark

records = load_dataset("golden")  # 300 curated puzzles

record = records[0]

# Create and interact with a puzzle

puzzle = Puzzle.from_url(record["puzzlink_url"])

print(puzzle.pid)           # e.g., "sudoku"

print(puzzle.get_state())   # board state as text

# Apply moves and check

puzzle.send_move("mouse,left,3,5")

violations = puzzle.check()      # [] if valid

solved = puzzle.is_complete()    # True when solved

# Render as SVG

svg = puzzle.svg()

```

## Gymnasium Environment

```python

from ppbench import PuzzleEnv, load_dataset

records = load_dataset("golden")

env = PuzzleEnv(puzzle_url=records[0]["puzzlink_url"])

obs, info = env.reset()

# Standard Gymnasium loop

obs, reward, terminated, truncated, info = env.step("mouse,left,3,5")

# reward = 1.0 when puzzle is solved

```

## Verifiers Environment (GRPO Training)

```python

from ppbench.verifiers_env import load_environment

env = load_environment("golden")

# Use with verifiers GRPO training pipeline

```

## Datasets

| Name | Size | Description |

|------|------|-------------|

| `golden` / `golden_300` | 300 puzzles | Curated benchmark (20 types × 15 each), bundled |

| `golden_30` | 30 puzzles | Small subset for expensive agentic strategies, bundled |

| `full` | 62,231 puzzles | All 94 puzzle types ([HuggingFace](https://huggingface.co/datasets/bluecoconut/pencil-puzzle-bench)) |

```python

from ppbench import load_dataset

# Bundled datasets (no download needed)

records = load_dataset("golden")      # 300 puzzles

records = load_dataset("golden_30")   # 30 puzzles

```

### Full dataset

Download from HuggingFace (one JSONL file):

```bash

# Using the huggingface-cli

pip install huggingface-hub

huggingface-cli download bluecoconut/pencil-puzzle-bench \

    full_dataset.jsonl \

    --repo-type dataset \

    --local-dir ppbench/data

```

Then load it:

```python

records = load_dataset("full")  # 62,231 puzzles

```

Each record contains:

- `puzzlink_url` — canonical puzzle URL (encodes the puzzle state)

- `pid` — puzzle type (e.g., `"sudoku"`, `"slither"`, `"tapa"`)

- `number_required_moves` — minimum moves to solve

- `solution` — decoded solution with `moves_full`, `moves_required`, `moves_hint`

## Running the Benchmark

```bash

# Set API keys for the providers you want to use

export OPENAI_API_KEY=...

export ANTHROPIC_API_KEY=...

# Quick test: 1 puzzle, both strategies

uv run python -u examples/quick_test.py

# Multi-model comparison

uv run python -u examples/multi_model.py

# Sweep an entire dataset

uv run python -u examples/dataset_sweep.py

# Analyze results

uv run python -u examples/analyze_results.py

```

Results are cached per (model, strategy, puzzle) — re-runs skip completed work.

### Using a local model

Point the benchmark at any OpenAI-compatible endpoint (LM Studio, ollama, vLLM, etc.):

```bash

# Default: http://127.0.0.1:1234/v1 (LM Studio default)

export LOCAL_API_BASE=http://127.0.0.1:1234/v1

# Or for ollama:

export LOCAL_API_BASE=http://127.0.0.1:11434/v1

```

```python

import asyncio

from ppbench.benchmarks import run, DirectAskStrategy

asyncio.run(run(

    models=["local/qwen3.5-35b-a3b"],

    strategies=[DirectAskStrategy],

    dataset="golden_30",

))

```

The model name after `local/` is passed directly to the server — use whatever model name your server expects.

## Architecture Guide

### Core primitive

`ppbench.Puzzle` wraps a headless [pzpr.js](https://github.com/robx/pzprjs) puzzle instance running in Node.js. pzpr.js is the engine behind the [puzz.link](https://puzz.link) puzzle community — it implements 100+ puzzle varieties with full rule checking, error localization, and completion detection. You send moves, check the board against variety-specific constraints, and verify completeness — all deterministically, no browser needed.

The benchmark harness uses [pydantic-ai](https://ai.pydantic.dev/) to build LLM agents that interact with puzzles.

### Models

Models use `provider/model-name@variant` syntax, parsed by `ppbench/benchmarks/model_list.py`:

| Provider | Example | Notes |

|----------|---------|-------|

| `openai` | `openai/gpt-4o` | Direct OpenAI API |

| `openai` | `openai/gpt-5.2@medium` | Responses API with reasoning effort |

| `anthropic` | `anthropic/claude-sonnet-4-6` | Direct Anthropic API |

| `anthropic` | `anthropic/claude-opus-4-6@thinking` | Extended thinking |

| `google` | `google/gemini-3-pro` | Gemini API |

| `xai` | `xai/grok-4-1-fast` | xAI API (OpenAI-compatible) |

| `openrouter` | `openrouter/deepseek/deepseek-v3.2` | OpenRouter (OpenAI-compatible) |

| `local` | `local/my-model` | Any local OpenAI-compatible server |

Each provider maps to a [pydantic-ai model class](https://ai.pydantic.dev/models/). To add a new provider, add a `_build_*` function in `ppbench/benchmarks/model_list.py`.

### Strategies

A strategy defines **what** the agent does. The harness handles execution, retries, usage tracking, and caching.

Subclass `ppbench.benchmarks.Strategy` and implement two methods:

```python

from ppbench.benchmarks import Strategy, AgentConfig, StrategyResult

from pydantic_ai import Agent

from ppbench import Puzzle

class MyStrategy(Strategy):

    requires_tools = False  # True if your agent uses tool calling

    def build_agent(self, puzzle, model_obj, model_name):

        """Create the agent and prompt. No execution happens here."""

        agent = Agent(model_obj, system_prompt="Solve this puzzle...")

        prompt = f"Puzzle: {puzzle.get_string_repr()}"

        return AgentConfig(agent=agent, prompt=prompt)

    def extract_result(self, puzzle, deps, output):

        """Interpret the agent's output. Replay moves, check success."""

        moves = parse_moves_somehow(output)

        fresh = Puzzle.from_url(puzzle.url)

        for m in moves:

            fresh.send_move(m)

        return StrategyResult(

            is_success=fresh.isComplete(),

            parsed_moves=moves,

            raw_output=output,

        )

```

Key concepts:

- `build_agent()` returns an `AgentConfig` with a [pydantic-ai Agent](https://ai.pydantic.dev/agents/), a prompt, and optional deps

- `extract_result()` replays moves on a fresh puzzle to verify the solution

- `on_node()` is an optional per-step hook (compactification, progress tracking, etc.)

- `strategy_id` is a hash of your strategy's source — harness changes don't invalidate the cache

Built-in strategies for reference:

- [`direct_ask.py`](ppbench/benchmarks/strategies/direct_ask.py) — single-shot, no tools (simplest)

- [`basic_agentic.py`](ppbench/benchmarks/strategies/basic_agentic.py) — tool-calling agent with make_move, check_board, reset

### The `run()` API

```python

import asyncio

from ppbench.benchmarks import run, DirectAskStrategy, BasicAgenticSolve

results = asyncio.run(run(

    models=["openai/gpt-4o", "local/qwen3.5-35b"],

    strategies=[DirectAskStrategy, BasicAgenticSolve],

    dataset="golden_30",       # or "golden", "golden_300"

    puzzle_types=["tapa"],     # optional: filter by puzzle type

    n_puzzles=5,               # optional: limit count

    concurrency=10,            # max concurrent tasks

    seed=42,                   # reproducible puzzle sampling

))

```

Results are saved as JSONL index + JSON artifacts in `output/runs/`. See [`examples/analyze_results.py`](examples/analyze_results.py) for how to load and inspect them.

## Move Format

Puzzles use pzpr.js input commands:

| Move | Description |

|------|-------------|

| `mouse,left,x,y` | Left click at (x,y) |

| `mouse,right,x,y` | Right click at (x,y) |

| `mouse,left,x1,y1,x2,y2` | Drag from (x1,y1) to (x2,y2) |

| `mouse,leftx2,x,y` | Double left-click at (x,y) |

| `mouse,rightx2,x,y` | Double right-click at (x,y) |

| `key,1` | Press key '1' |

## License

MIT

## Citation

```bibtex

@article{waugh2026ppbench,

    title={Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning},

    author={Justin Waugh},

    year={2026},

    eprint={2603.02119},

    archivePrefix={arXiv},

    primaryClass={cs.AI},

    url={https://arxiv.org/abs/2603.02119}

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/approximatelabs/pencil-puzzle-bench

Awesome Lists containing this project

README