An open API service indexing awesome lists of open source software.

https://github.com/denial-web/hard-needle

Semantically hard multi-needle long-context data generator. Stop testing LLMs with random-password needles.
https://github.com/denial-web/hard-needle

benchmark llm llm-evaluation long-context needle-in-a-haystack python rag synthetic-data

Last synced: 27 days ago
JSON representation

Semantically hard multi-needle long-context data generator. Stop testing LLMs with random-password needles.

Awesome Lists containing this project

README

          

# hard-needle

[![CI](https://github.com/denial-web/hard-needle/actions/workflows/ci.yml/badge.svg)](https://github.com/denial-web/hard-needle/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/hard-needle.svg)](https://pypi.org/project/hard-needle/)
[![Python](https://img.shields.io/pypi/pyversions/hard-needle.svg)](https://pypi.org/project/hard-needle/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

**Stop testing long-context LLMs with random passwords.** `hard-needle` generates haystacks where multiple confusable facts share the same template, so the model has to actually disambiguate by entity instead of pattern-matching a unique token.

```bash
pip install hard-needle
hard-needle-generate --num-examples 200 --num-needles 6 --ctx-chars 12000 --output eval.jsonl
```

## What you get

Each example places **multiple semantically similar facts** in a haystack and asks about one of them:

```
Document:
... [haystack of distractor sentences] ...
The special project code for Aurora is ATL-7704.
... [more distractors] ...
The special project code for Aegis is ANV-5503.
... [more distractors] ...
The special project code for Apollo is ATL-7701.
... [more distractors] ...

Question: What is the special project code for Apollo?
Answer: ATL-7701
```

A model that "just remembers there was a project code in the document" gets it wrong. It has to **bind the right code to the right project name**.

## Why it matters

Most public Needle-in-a-Haystack benchmarks insert one obvious sentence (`The magic password is 7XQ32B`) into Paul Graham essays. Modern LLMs ace this with shallow attention because the needle has unique surface form. Real long-context tasks — reading meeting notes, parsing legal documents, multi-hop QA — almost never look like that.

`hard-needle` gives you:

| | Standard NIH | `hard-needle` |
|---|---|---|
| Distractors | Generic prose | Semantically similar facts (multiple project codes, multiple deadlines, etc.) |
| Disambiguation | None — needle is unique | Required — model must bind value to entity |
| Eval pool isolation | N/A | Disjoint `default` / `unseen` entity pools to detect memorization |
| Output | Plain text | Structured `needle_records` with type, entity, value, char position, depth fraction |
| Negatives | None | Optional paired `corrupt_example` for contrastive eval |

Designed for: **honest long-context evaluation, contrastive training data, lost-in-the-middle studies with realistic confusion.**

## Quickstart (Python)

```python
from hard_needle import generate_hard_example, generate_dataset

ex = generate_hard_example(num_needles=3, ctx_chars=8000, seed=42)
print(ex["prompt"]) # full input prompt
print(ex["target"]) # gold answer
for r in ex["needle_records"]:
print(r["type"], r["entity"], "->", r["value"], f"(depth={r['depth_frac']:.2f})")

ds = generate_dataset(
num_examples=500,
num_needles=6,
ctx_chars=12000,
pool_set="default", # or "unseen" for held-out generalization eval
include_corrupted=True,
corruption_ratio=0.2,
seed=0,
)
```

## CLI

```bash
hard-needle-generate \
--num-examples 1000 \
--num-needles 6 \
--ctx-chars 12000 \
--pool-set default \
--include-corrupted \
--corruption-ratio 0.2 \
--seed 42 \
--output train.jsonl

# Disjoint eval pool — no entity/value overlap with --pool-set default
hard-needle-generate \
--num-examples 200 \
--num-needles 6 \
--ctx-chars 12000 \
--pool-set unseen \
--seed 100 \
--output eval.jsonl
```

Each output line is a JSON object:

```json
{
"prompt": "You are an internal assistant for the ...",
"target": "ATL-7701",
"text": " ",
"question": "What is the special project code for Apollo?",
"target_needle_type": "project_code",
"target_entity": "Apollo",
"target_value": "ATL-7701",
"needle_records": [
{
"type": "project_code",
"entity": "Aurora",
"value": "ATL-7704",
"sentence": "The special project code for Aurora is ATL-7704.",
"char_pos": 1842,
"depth_frac": 0.42
}
],
"num_needles": 3,
"ctx_chars": 8000,
"pool_set": "default",
"is_corrupted": false
}
```

## Needle types

Each example uses one of four needle types — all entities are projects, but the value type varies:

| Type | Entity example | Value example |
|---|---|---|
| `project_code` | `Aurora` | `AUR-4521` |
| `deadline` | `Apollo` | `April 03` |
| `budget` | `Atlas` | `$1.4M` |
| `lead` | `Andromeda` | `Dr. Sarah Chen` |

Disjoint `unseen` pool uses different surface forms (e.g. `Brontis`, `BRX-9001`, `Dr. Aiko Tanaka`) for held-out generalization eval.

## Optional extras

```bash
pip install "hard-needle[datasets]" # PG-19 streaming distractors (vs builtin pool)
pip install "hard-needle[tokenizer]" # Token-aware truncation via transformers
pip install "hard-needle[dev]" # pytest
```

## Limitations

- Context length is controlled in **characters** by default. Token-aware truncation requires `[tokenizer]` extra and is best-effort across needle insertions.
- Builtin distractor pool is small. Use `--distractor-source pg19` for production-scale data.
- Templates are deliberately simple ("The X for Y is Z"). For paraphrase-robustness studies, augment downstream.

## Citing / links

If `hard-needle` helped your research or evaluation, a star is appreciated. If you publish using it, drop a link to your work in the issues — happy to maintain a "used by" list.

## License

MIT — see [LICENSE](LICENSE).