An open API service indexing awesome lists of open source software.

https://github.com/hinanohart/poolcheck

Combinatorial group testing for verifier-cost-limited reasoning: localize defective reasoning steps from ~k*log(n/k) pooled verifier queries under an explicit asymmetric noise model (simulated headline; pre-alpha).
https://github.com/hinanohart/poolcheck

chain-of-thought group-testing llm process-reward reasoning test-time-compute verifier

Last synced: 4 days ago
JSON representation

Combinatorial group testing for verifier-cost-limited reasoning: localize defective reasoning steps from ~k*log(n/k) pooled verifier queries under an explicit asymmetric noise model (simulated headline; pre-alpha).

Awesome Lists containing this project

README

          

# poolcheck

**Combinatorial group testing for verifier-cost-limited reasoning.**

When you verify a chain of reasoning, the expensive part is the *verifier* (an
LLM judge, a process reward model, a test suite). poolcheck spends fewer verifier
calls by **pooling** items into a single query — "does *any* step in this pool
contain a mistake?" — and **decoding** the answers to localize the defective
items, instead of checking each of the `n` items one at a time.

For `k` defectives among `n` items, the number of pooled queries can be
`~k·log(n/k)` instead of `n` — *under an explicit noise model* that this library
makes you state.

> **Status: `v0.1.0a1` (pre-alpha).** The decoders are correct and tested; the
> headline numbers come from a **simulated** noise model, not a deployed judge.
> See [Claims](#claims-and-non-claims) and [`s0_report.json`](s0_report.json)
> before relying on anything here.

---

## Install

```bash
pip install poolcheck # core (numpy + scipy only)
pip install 'poolcheck[hf]' # + Hugging Face Inference verifier
```

## Quickstart

```python
import numpy as np
from poolcheck import ItemSet, NoiseChannel, SimulatedJudge, localize

# 8 chain-of-thought steps; the faulty one is step 5 (unknown to the decoder).
items = ItemSet.from_cot([f"step {i}" for i in range(8)])

# A judge that misses real errors 40% of the time and false-alarms 10%
# (a lenient operating point grounded in the LLM-as-judge literature).
noise = NoiseChannel(alpha_fa=0.10, beta_md=0.40)
judge = SimulatedJudge(truth={5}, noise=noise, n=8, rng=np.random.default_rng(0))

result = localize(items, judge, budget=12, noise=noise, k=1, rng=np.random.default_rng(0))
print(result.defectives) # -> localized faulty step(s)
```

Measure the simulated budget → accuracy frontier from the CLI:

```bash
poolcheck frontier --n 32 --k 1 --alpha 0.1 --beta 0.4
```

## How it works

1. **Design** a pooling (test) matrix — which items go in which pooled query.
Default is a near-constant-column-weight design (outperforms i.i.d. Bernoulli,
arXiv:1612.07122); a deterministic Reed-Solomon (Kautz-Singleton) `d`-disjunct
design is also provided.
2. **Query** the verifier once per pool.
3. **Decode** the defective set. Noiseless: COMP / DD / SCOMP. Noisy: a per-item
separate-decoding **log-likelihood-ratio** decoder tuned to the channel's
asymmetric false-alarm (`alpha_fa`) and missed-detect (`beta_md`) rates. A
threshold trades precision for recall.

The core (`design`, `decode`, `adaptive`, `noise`, `frontier`) never imports a
verifier and never touches the network — it is fully deterministic given a seed.

## Headline (simulated)

_Generated by `scripts/measure.py` (seed=0, n=32, k=1, 300 trials). Simulated noise model only — not a deployed-judge benchmark._

**noiseless** (alpha_fa=0.0, beta_md=0.0)
Per-item baseline: **32 queries**, F1=1.000 (95% CI [1.000, 1.000])

| pooled queries | F1 | 95% CI | vs per-item baseline |
|---:|---:|:---:|:---:|
| 5 | 0.027 | [0.010, 0.047] | worse |
| 8 | 0.323 | [0.270, 0.377] | worse |
| 12 | 0.923 | [0.893, 0.950] | worse |
| 16 | 0.990 | [0.977, 1.000] | matches (fewer queries) |
| 24 | 1.000 | [1.000, 1.000] | matches (fewer queries) |

**lenient_judge** (alpha_fa=0.1, beta_md=0.4)
Per-item baseline: **32 queries**, F1=0.264 (95% CI [0.236, 0.292])

| pooled queries | F1 | 95% CI | vs per-item baseline |
|---:|---:|:---:|:---:|
| 5 | 0.100 | [0.067, 0.133] | worse |
| 8 | 0.130 | [0.093, 0.170] | worse |
| 12 | 0.270 | [0.220, 0.323] | matches (fewer queries) |
| 16 | 0.387 | [0.330, 0.443] | **better** (fewer queries) |
| 24 | 0.580 | [0.523, 0.637] | **better** (fewer queries) |

_"vs per-item baseline" compares the pooled F1 bootstrap CI to the baseline CI (which uses n per-item queries): non-overlapping above = **better**, overlapping = statistically indistinguishable (**matches**), non-overlapping below = worse. Pooled always uses fewer queries than the baseline._

## Claims and non-claims

**CLAIM.** poolcheck decodes a defective set (e.g. the first faulty CoT step)
from `~k·log(n/k)` pooled verifier queries instead of `n` per-item queries,
**under an explicit asymmetric false-alarm / missed-detect noise model**. The
headline budget → accuracy frontier is computed with a *simulated* judge oracle
(deterministic, no API key, reproducible via `scripts/measure.py`); its noise
parameters are grounded in published single-item LLM-judge error rates
(see [`s0_report.json`](s0_report.json)).

**NON-CLAIMS.**

- **No speedup is claimed for any specific deployed model.** Every headline
number is from the simulated noise model above, not a live-judge benchmark.
- **The "1-good-among-N" regime is explicitly out of scope.** Picking the single
good answer out of `N` candidates is the `k = N-1` regime, where group testing
provably **cannot** beat individual testing (`Θ(n)` tests when `k = ω(n/log n)`,
arXiv:2006.01325, arXiv:2106.06878). `ItemSet.from_candidates` is a
`k << N` shortlist-narrowing convenience only — it is **not** the headline.
- **poolcheck is not a process reward model.** It does not *score* steps; it
*localizes* defectives. `ItemSet.priors` is an unused experimental seam (off by
default); supplying priors lets a downstream PRM bias the decoder, but that path
is unbenchmarked here.
- **The pooled-query premise is unverified in this release.** The simulated noise
uses *single-item* literature rates and assumes they also hold for *pooled*
queries. Whether pooling degrades a real judge's FA/MD (residual risk #1) is
**OPEN** — see below.

## Did pooling break my judge? (`s0-gate`)

The one empirical question poolcheck cannot answer for you: does asking your judge
"are *any* of these N steps wrong?" make it noticeably worse than asking about one
step at a time? Measure it on your own judge:

```bash
poolcheck s0-gate --cases your_cases.json --verifier hf:Qwen/Qwen2.5-7B-Instruct \
--pool-sizes 4 8
```

`PASS` (pooled FA/MD ≤ 1.5× single, with bootstrap CIs) means the simulated
advantage should transfer. `FAIL` means it may not. This build ships the tool but
**did not run it against a live judge** (no inference token was available in the
build environment); see [`s0_report.json`](s0_report.json).

## Public API

`Verifier` · `ItemSet` · `localize` · `Strategy` · `frontier`
(plus supporting types `NoiseChannel`, `SimulatedJudge`, `DeterministicJudge`,
`LocalizeResult`).

## License

[Apache-2.0](LICENSE). © 2026 the poolcheck authors.