https://github.com/hinanohart/poolcheck
Combinatorial group testing for verifier-cost-limited reasoning: localize defective reasoning steps from ~k*log(n/k) pooled verifier queries under an explicit asymmetric noise model (simulated headline; pre-alpha).
https://github.com/hinanohart/poolcheck
chain-of-thought group-testing llm process-reward reasoning test-time-compute verifier
Last synced: 4 days ago
JSON representation
Combinatorial group testing for verifier-cost-limited reasoning: localize defective reasoning steps from ~k*log(n/k) pooled verifier queries under an explicit asymmetric noise model (simulated headline; pre-alpha).
- Host: GitHub
- URL: https://github.com/hinanohart/poolcheck
- Owner: hinanohart
- License: apache-2.0
- Created: 2026-05-29T03:58:34.000Z (5 days ago)
- Default Branch: main
- Last Pushed: 2026-05-29T04:08:30.000Z (5 days ago)
- Last Synced: 2026-05-29T06:07:11.854Z (5 days ago)
- Topics: chain-of-thought, group-testing, llm, process-reward, reasoning, test-time-compute, verifier
- Language: Python
- Size: 221 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Notice: NOTICE
Awesome Lists containing this project
README
# poolcheck
**Combinatorial group testing for verifier-cost-limited reasoning.**
When you verify a chain of reasoning, the expensive part is the *verifier* (an
LLM judge, a process reward model, a test suite). poolcheck spends fewer verifier
calls by **pooling** items into a single query — "does *any* step in this pool
contain a mistake?" — and **decoding** the answers to localize the defective
items, instead of checking each of the `n` items one at a time.
For `k` defectives among `n` items, the number of pooled queries can be
`~k·log(n/k)` instead of `n` — *under an explicit noise model* that this library
makes you state.
> **Status: `v0.1.0a1` (pre-alpha).** The decoders are correct and tested; the
> headline numbers come from a **simulated** noise model, not a deployed judge.
> See [Claims](#claims-and-non-claims) and [`s0_report.json`](s0_report.json)
> before relying on anything here.
---
## Install
```bash
pip install poolcheck # core (numpy + scipy only)
pip install 'poolcheck[hf]' # + Hugging Face Inference verifier
```
## Quickstart
```python
import numpy as np
from poolcheck import ItemSet, NoiseChannel, SimulatedJudge, localize
# 8 chain-of-thought steps; the faulty one is step 5 (unknown to the decoder).
items = ItemSet.from_cot([f"step {i}" for i in range(8)])
# A judge that misses real errors 40% of the time and false-alarms 10%
# (a lenient operating point grounded in the LLM-as-judge literature).
noise = NoiseChannel(alpha_fa=0.10, beta_md=0.40)
judge = SimulatedJudge(truth={5}, noise=noise, n=8, rng=np.random.default_rng(0))
result = localize(items, judge, budget=12, noise=noise, k=1, rng=np.random.default_rng(0))
print(result.defectives) # -> localized faulty step(s)
```
Measure the simulated budget → accuracy frontier from the CLI:
```bash
poolcheck frontier --n 32 --k 1 --alpha 0.1 --beta 0.4
```
## How it works
1. **Design** a pooling (test) matrix — which items go in which pooled query.
Default is a near-constant-column-weight design (outperforms i.i.d. Bernoulli,
arXiv:1612.07122); a deterministic Reed-Solomon (Kautz-Singleton) `d`-disjunct
design is also provided.
2. **Query** the verifier once per pool.
3. **Decode** the defective set. Noiseless: COMP / DD / SCOMP. Noisy: a per-item
separate-decoding **log-likelihood-ratio** decoder tuned to the channel's
asymmetric false-alarm (`alpha_fa`) and missed-detect (`beta_md`) rates. A
threshold trades precision for recall.
The core (`design`, `decode`, `adaptive`, `noise`, `frontier`) never imports a
verifier and never touches the network — it is fully deterministic given a seed.
## Headline (simulated)
_Generated by `scripts/measure.py` (seed=0, n=32, k=1, 300 trials). Simulated noise model only — not a deployed-judge benchmark._
**noiseless** (alpha_fa=0.0, beta_md=0.0)
Per-item baseline: **32 queries**, F1=1.000 (95% CI [1.000, 1.000])
| pooled queries | F1 | 95% CI | vs per-item baseline |
|---:|---:|:---:|:---:|
| 5 | 0.027 | [0.010, 0.047] | worse |
| 8 | 0.323 | [0.270, 0.377] | worse |
| 12 | 0.923 | [0.893, 0.950] | worse |
| 16 | 0.990 | [0.977, 1.000] | matches (fewer queries) |
| 24 | 1.000 | [1.000, 1.000] | matches (fewer queries) |
**lenient_judge** (alpha_fa=0.1, beta_md=0.4)
Per-item baseline: **32 queries**, F1=0.264 (95% CI [0.236, 0.292])
| pooled queries | F1 | 95% CI | vs per-item baseline |
|---:|---:|:---:|:---:|
| 5 | 0.100 | [0.067, 0.133] | worse |
| 8 | 0.130 | [0.093, 0.170] | worse |
| 12 | 0.270 | [0.220, 0.323] | matches (fewer queries) |
| 16 | 0.387 | [0.330, 0.443] | **better** (fewer queries) |
| 24 | 0.580 | [0.523, 0.637] | **better** (fewer queries) |
_"vs per-item baseline" compares the pooled F1 bootstrap CI to the baseline CI (which uses n per-item queries): non-overlapping above = **better**, overlapping = statistically indistinguishable (**matches**), non-overlapping below = worse. Pooled always uses fewer queries than the baseline._
## Claims and non-claims
**CLAIM.** poolcheck decodes a defective set (e.g. the first faulty CoT step)
from `~k·log(n/k)` pooled verifier queries instead of `n` per-item queries,
**under an explicit asymmetric false-alarm / missed-detect noise model**. The
headline budget → accuracy frontier is computed with a *simulated* judge oracle
(deterministic, no API key, reproducible via `scripts/measure.py`); its noise
parameters are grounded in published single-item LLM-judge error rates
(see [`s0_report.json`](s0_report.json)).
**NON-CLAIMS.**
- **No speedup is claimed for any specific deployed model.** Every headline
number is from the simulated noise model above, not a live-judge benchmark.
- **The "1-good-among-N" regime is explicitly out of scope.** Picking the single
good answer out of `N` candidates is the `k = N-1` regime, where group testing
provably **cannot** beat individual testing (`Θ(n)` tests when `k = ω(n/log n)`,
arXiv:2006.01325, arXiv:2106.06878). `ItemSet.from_candidates` is a
`k << N` shortlist-narrowing convenience only — it is **not** the headline.
- **poolcheck is not a process reward model.** It does not *score* steps; it
*localizes* defectives. `ItemSet.priors` is an unused experimental seam (off by
default); supplying priors lets a downstream PRM bias the decoder, but that path
is unbenchmarked here.
- **The pooled-query premise is unverified in this release.** The simulated noise
uses *single-item* literature rates and assumes they also hold for *pooled*
queries. Whether pooling degrades a real judge's FA/MD (residual risk #1) is
**OPEN** — see below.
## Did pooling break my judge? (`s0-gate`)
The one empirical question poolcheck cannot answer for you: does asking your judge
"are *any* of these N steps wrong?" make it noticeably worse than asking about one
step at a time? Measure it on your own judge:
```bash
poolcheck s0-gate --cases your_cases.json --verifier hf:Qwen/Qwen2.5-7B-Instruct \
--pool-sizes 4 8
```
`PASS` (pooled FA/MD ≤ 1.5× single, with bootstrap CIs) means the simulated
advantage should transfer. `FAIL` means it may not. This build ships the tool but
**did not run it against a live judge** (no inference token was available in the
build environment); see [`s0_report.json`](s0_report.json).
## Public API
`Verifier` · `ItemSet` · `localize` · `Strategy` · `frontier`
(plus supporting types `NoiseChannel`, `SimulatedJudge`, `DeterministicJudge`,
`LocalizeResult`).
## License
[Apache-2.0](LICENSE). © 2026 the poolcheck authors.