https://github.com/denial-web/hard-needle
Semantically hard multi-needle long-context data generator. Stop testing LLMs with random-password needles.
https://github.com/denial-web/hard-needle
benchmark llm llm-evaluation long-context needle-in-a-haystack python rag synthetic-data
Last synced: 27 days ago
JSON representation
Semantically hard multi-needle long-context data generator. Stop testing LLMs with random-password needles.
- Host: GitHub
- URL: https://github.com/denial-web/hard-needle
- Owner: denial-web
- License: mit
- Created: 2026-04-29T04:28:06.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-04-29T04:28:29.000Z (about 1 month ago)
- Last Synced: 2026-04-30T00:13:40.944Z (about 1 month ago)
- Topics: benchmark, llm, llm-evaluation, long-context, needle-in-a-haystack, python, rag, synthetic-data
- Language: Python
- Homepage: https://pypi.org/project/hard-needle/
- Size: 28.3 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# hard-needle
[](https://github.com/denial-web/hard-needle/actions/workflows/ci.yml)
[](https://pypi.org/project/hard-needle/)
[](https://pypi.org/project/hard-needle/)
[](LICENSE)
**Stop testing long-context LLMs with random passwords.** `hard-needle` generates haystacks where multiple confusable facts share the same template, so the model has to actually disambiguate by entity instead of pattern-matching a unique token.
```bash
pip install hard-needle
hard-needle-generate --num-examples 200 --num-needles 6 --ctx-chars 12000 --output eval.jsonl
```
## What you get
Each example places **multiple semantically similar facts** in a haystack and asks about one of them:
```
Document:
... [haystack of distractor sentences] ...
The special project code for Aurora is ATL-7704.
... [more distractors] ...
The special project code for Aegis is ANV-5503.
... [more distractors] ...
The special project code for Apollo is ATL-7701.
... [more distractors] ...
Question: What is the special project code for Apollo?
Answer: ATL-7701
```
A model that "just remembers there was a project code in the document" gets it wrong. It has to **bind the right code to the right project name**.
## Why it matters
Most public Needle-in-a-Haystack benchmarks insert one obvious sentence (`The magic password is 7XQ32B`) into Paul Graham essays. Modern LLMs ace this with shallow attention because the needle has unique surface form. Real long-context tasks — reading meeting notes, parsing legal documents, multi-hop QA — almost never look like that.
`hard-needle` gives you:
| | Standard NIH | `hard-needle` |
|---|---|---|
| Distractors | Generic prose | Semantically similar facts (multiple project codes, multiple deadlines, etc.) |
| Disambiguation | None — needle is unique | Required — model must bind value to entity |
| Eval pool isolation | N/A | Disjoint `default` / `unseen` entity pools to detect memorization |
| Output | Plain text | Structured `needle_records` with type, entity, value, char position, depth fraction |
| Negatives | None | Optional paired `corrupt_example` for contrastive eval |
Designed for: **honest long-context evaluation, contrastive training data, lost-in-the-middle studies with realistic confusion.**
## Quickstart (Python)
```python
from hard_needle import generate_hard_example, generate_dataset
ex = generate_hard_example(num_needles=3, ctx_chars=8000, seed=42)
print(ex["prompt"]) # full input prompt
print(ex["target"]) # gold answer
for r in ex["needle_records"]:
print(r["type"], r["entity"], "->", r["value"], f"(depth={r['depth_frac']:.2f})")
ds = generate_dataset(
num_examples=500,
num_needles=6,
ctx_chars=12000,
pool_set="default", # or "unseen" for held-out generalization eval
include_corrupted=True,
corruption_ratio=0.2,
seed=0,
)
```
## CLI
```bash
hard-needle-generate \
--num-examples 1000 \
--num-needles 6 \
--ctx-chars 12000 \
--pool-set default \
--include-corrupted \
--corruption-ratio 0.2 \
--seed 42 \
--output train.jsonl
# Disjoint eval pool — no entity/value overlap with --pool-set default
hard-needle-generate \
--num-examples 200 \
--num-needles 6 \
--ctx-chars 12000 \
--pool-set unseen \
--seed 100 \
--output eval.jsonl
```
Each output line is a JSON object:
```json
{
"prompt": "You are an internal assistant for the ...",
"target": "ATL-7701",
"text": " ",
"question": "What is the special project code for Apollo?",
"target_needle_type": "project_code",
"target_entity": "Apollo",
"target_value": "ATL-7701",
"needle_records": [
{
"type": "project_code",
"entity": "Aurora",
"value": "ATL-7704",
"sentence": "The special project code for Aurora is ATL-7704.",
"char_pos": 1842,
"depth_frac": 0.42
}
],
"num_needles": 3,
"ctx_chars": 8000,
"pool_set": "default",
"is_corrupted": false
}
```
## Needle types
Each example uses one of four needle types — all entities are projects, but the value type varies:
| Type | Entity example | Value example |
|---|---|---|
| `project_code` | `Aurora` | `AUR-4521` |
| `deadline` | `Apollo` | `April 03` |
| `budget` | `Atlas` | `$1.4M` |
| `lead` | `Andromeda` | `Dr. Sarah Chen` |
Disjoint `unseen` pool uses different surface forms (e.g. `Brontis`, `BRX-9001`, `Dr. Aiko Tanaka`) for held-out generalization eval.
## Optional extras
```bash
pip install "hard-needle[datasets]" # PG-19 streaming distractors (vs builtin pool)
pip install "hard-needle[tokenizer]" # Token-aware truncation via transformers
pip install "hard-needle[dev]" # pytest
```
## Limitations
- Context length is controlled in **characters** by default. Token-aware truncation requires `[tokenizer]` extra and is best-effort across needle insertions.
- Builtin distractor pool is small. Use `--distractor-source pg19` for production-scale data.
- Templates are deliberately simple ("The X for Y is Z"). For paraphrase-robustness studies, augment downstream.
## Citing / links
If `hard-needle` helped your research or evaluation, a star is appreciated. If you publish using it, drop a link to your work in the issues — happy to maintain a "used by" list.
## License
MIT — see [LICENSE](LICENSE).