https://github.com/hinanohart/poolcheck

Combinatorial group testing for verifier-cost-limited reasoning: localize defective reasoning steps from ~k*log(n/k) pooled verifier queries under an explicit asymmetric noise model (simulated headline; pre-alpha).
https://github.com/hinanohart/poolcheck

chain-of-thought group-testing llm process-reward reasoning test-time-compute verifier

Last synced: 29 days ago
JSON representation

Host: GitHub
URL: https://github.com/hinanohart/poolcheck
Owner: hinanohart
License: apache-2.0
Created: 2026-05-29T03:58:34.000Z (29 days ago)
Default Branch: main
Last Pushed: 2026-05-29T04:08:30.000Z (29 days ago)
Last Synced: 2026-05-29T06:07:11.854Z (29 days ago)
Topics: chain-of-thought, group-testing, llm, process-reward, reasoning, test-time-compute, verifier
Language: Python
Size: 221 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Notice: NOTICE

Awesome Lists containing this project

README

          # poolcheck

**Combinatorial group testing for verifier-cost-limited reasoning.**

When you verify a chain of reasoning, the expensive part is the *verifier* (an

LLM judge, a process reward model, a test suite). poolcheck spends fewer verifier

calls by **pooling** items into a single query — "does *any* step in this pool

contain a mistake?" — and **decoding** the answers to localize the defective

items, instead of checking each of the `n` items one at a time.

For `k` defectives among `n` items, the number of pooled queries can be

`~k·log(n/k)` instead of `n` — *under an explicit noise model* that this library

makes you state.

> **Status: `v0.1.0a1` (pre-alpha).** The decoders are correct and tested; the

> headline numbers come from a **simulated** noise model, not a deployed judge.

> See [Claims](#claims-and-non-claims) and [`s0_report.json`](s0_report.json)

> before relying on anything here.

---

## Install

```bash

pip install poolcheck            # core (numpy + scipy only)

pip install 'poolcheck[hf]'      # + Hugging Face Inference verifier

```

## Quickstart

```python

import numpy as np

from poolcheck import ItemSet, NoiseChannel, SimulatedJudge, localize

# 8 chain-of-thought steps; the faulty one is step 5 (unknown to the decoder).

items = ItemSet.from_cot([f"step {i}" for i in range(8)])

# A judge that misses real errors 40% of the time and false-alarms 10%

# (a lenient operating point grounded in the LLM-as-judge literature).

noise = NoiseChannel(alpha_fa=0.10, beta_md=0.40)

judge = SimulatedJudge(truth={5}, noise=noise, n=8, rng=np.random.default_rng(0))

result = localize(items, judge, budget=12, noise=noise, k=1, rng=np.random.default_rng(0))

print(result.defectives)   # -> localized faulty step(s)

```

Measure the simulated budget → accuracy frontier from the CLI:

```bash

poolcheck frontier --n 32 --k 1 --alpha 0.1 --beta 0.4

```

## How it works

1. **Design** a pooling (test) matrix — which items go in which pooled query.

   Default is a near-constant-column-weight design (outperforms i.i.d. Bernoulli,

   arXiv:1612.07122); a deterministic Reed-Solomon (Kautz-Singleton) `d`-disjunct

   design is also provided.

2. **Query** the verifier once per pool.

3. **Decode** the defective set. Noiseless: COMP / DD / SCOMP. Noisy: a per-item

   separate-decoding **log-likelihood-ratio** decoder tuned to the channel's

   asymmetric false-alarm (`alpha_fa`) and missed-detect (`beta_md`) rates. A

   threshold trades precision for recall.

The core (`design`, `decode`, `adaptive`, `noise`, `frontier`) never imports a

verifier and never touches the network — it is fully deterministic given a seed.

## Headline (simulated)

_Generated by `scripts/measure.py` (seed=0, n=32, k=1, 300 trials). Simulated noise model only — not a deployed-judge benchmark._

**noiseless** (alpha_fa=0.0, beta_md=0.0)  

Per-item baseline: **32 queries**, F1=1.000 (95% CI [1.000, 1.000])

| pooled queries | F1 | 95% CI | vs per-item baseline |

|---:|---:|:---:|:---:|

| 5 | 0.027 | [0.010, 0.047] | worse |

| 8 | 0.323 | [0.270, 0.377] | worse |

| 12 | 0.923 | [0.893, 0.950] | worse |

| 16 | 0.990 | [0.977, 1.000] | matches (fewer queries) |

| 24 | 1.000 | [1.000, 1.000] | matches (fewer queries) |

**lenient_judge** (alpha_fa=0.1, beta_md=0.4)  

Per-item baseline: **32 queries**, F1=0.264 (95% CI [0.236, 0.292])

| pooled queries | F1 | 95% CI | vs per-item baseline |

|---:|---:|:---:|:---:|

| 5 | 0.100 | [0.067, 0.133] | worse |

| 8 | 0.130 | [0.093, 0.170] | worse |

| 12 | 0.270 | [0.220, 0.323] | matches (fewer queries) |

| 16 | 0.387 | [0.330, 0.443] | **better** (fewer queries) |

| 24 | 0.580 | [0.523, 0.637] | **better** (fewer queries) |

_"vs per-item baseline" compares the pooled F1 bootstrap CI to the baseline CI (which uses n per-item queries): non-overlapping above = **better**, overlapping = statistically indistinguishable (**matches**), non-overlapping below = worse. Pooled always uses fewer queries than the baseline._

## Claims and non-claims

**CLAIM.** poolcheck decodes a defective set (e.g. the first faulty CoT step)

from `~k·log(n/k)` pooled verifier queries instead of `n` per-item queries,

**under an explicit asymmetric false-alarm / missed-detect noise model**. The

headline budget → accuracy frontier is computed with a *simulated* judge oracle

(deterministic, no API key, reproducible via `scripts/measure.py`); its noise

parameters are grounded in published single-item LLM-judge error rates

(see [`s0_report.json`](s0_report.json)).

**NON-CLAIMS.**

- **No speedup is claimed for any specific deployed model.** Every headline

  number is from the simulated noise model above, not a live-judge benchmark.

- **The "1-good-among-N" regime is explicitly out of scope.** Picking the single

  good answer out of `N` candidates is the `k = N-1` regime, where group testing

  provably **cannot** beat individual testing (`Θ(n)` tests when `k = ω(n/log n)`,

  arXiv:2006.01325, arXiv:2106.06878). `ItemSet.from_candidates` is a

  `k << N` shortlist-narrowing convenience only — it is **not** the headline.

- **poolcheck is not a process reward model.** It does not *score* steps; it

  *localizes* defectives. `ItemSet.priors` is an unused experimental seam (off by

  default); supplying priors lets a downstream PRM bias the decoder, but that path

  is unbenchmarked here.

- **The pooled-query premise is unverified in this release.** The simulated noise

  uses *single-item* literature rates and assumes they also hold for *pooled*

  queries. Whether pooling degrades a real judge's FA/MD (residual risk #1) is

  **OPEN** — see below.

## Did pooling break my judge? (`s0-gate`)

The one empirical question poolcheck cannot answer for you: does asking your judge

"are *any* of these N steps wrong?" make it noticeably worse than asking about one

step at a time? Measure it on your own judge:

```bash

poolcheck s0-gate --cases your_cases.json --verifier hf:Qwen/Qwen2.5-7B-Instruct \

    --pool-sizes 4 8

```

`PASS` (pooled FA/MD ≤ 1.5× single, with bootstrap CIs) means the simulated

advantage should transfer. `FAIL` means it may not. This build ships the tool but

**did not run it against a live judge** (no inference token was available in the

build environment); see [`s0_report.json`](s0_report.json).

## Public API

`Verifier` · `ItemSet` · `localize` · `Strategy` · `frontier`

(plus supporting types `NoiseChannel`, `SimulatedJudge`, `DeterministicJudge`,

`LocalizeResult`).

## License

[Apache-2.0](LICENSE). © 2026 the poolcheck authors.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hinanohart/poolcheck

Awesome Lists containing this project

README