https://github.com/denial-web/hard-needle

Semantically hard multi-needle long-context data generator. Stop testing LLMs with random-password needles.
https://github.com/denial-web/hard-needle

benchmark llm llm-evaluation long-context needle-in-a-haystack python rag synthetic-data

Last synced: 27 days ago
JSON representation

Semantically hard multi-needle long-context data generator. Stop testing LLMs with random-password needles.

Host: GitHub
URL: https://github.com/denial-web/hard-needle
Owner: denial-web
License: mit
Created: 2026-04-29T04:28:06.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-04-29T04:28:29.000Z (about 1 month ago)
Last Synced: 2026-04-30T00:13:40.944Z (about 1 month ago)
Topics: benchmark, llm, llm-evaluation, long-context, needle-in-a-haystack, python, rag, synthetic-data
Language: Python
Homepage: https://pypi.org/project/hard-needle/
Size: 28.3 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # hard-needle

[![CI](https://github.com/denial-web/hard-needle/actions/workflows/ci.yml/badge.svg)](https://github.com/denial-web/hard-needle/actions/workflows/ci.yml)

[![PyPI](https://img.shields.io/pypi/v/hard-needle.svg)](https://pypi.org/project/hard-needle/)

[![Python](https://img.shields.io/pypi/pyversions/hard-needle.svg)](https://pypi.org/project/hard-needle/)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

**Stop testing long-context LLMs with random passwords.** `hard-needle` generates haystacks where multiple confusable facts share the same template, so the model has to actually disambiguate by entity instead of pattern-matching a unique token.

```bash

pip install hard-needle

hard-needle-generate --num-examples 200 --num-needles 6 --ctx-chars 12000 --output eval.jsonl

```

## What you get

Each example places **multiple semantically similar facts** in a haystack and asks about one of them:

```

Document:

... [haystack of distractor sentences] ...

The special project code for Aurora is ATL-7704.

... [more distractors] ...

The special project code for Aegis is ANV-5503.

... [more distractors] ...

The special project code for Apollo is ATL-7701.

... [more distractors] ...

Question: What is the special project code for Apollo?

Answer: ATL-7701

```

A model that "just remembers there was a project code in the document" gets it wrong. It has to **bind the right code to the right project name**.

## Why it matters

Most public Needle-in-a-Haystack benchmarks insert one obvious sentence (`The magic password is 7XQ32B`) into Paul Graham essays. Modern LLMs ace this with shallow attention because the needle has unique surface form. Real long-context tasks — reading meeting notes, parsing legal documents, multi-hop QA — almost never look like that.

`hard-needle` gives you:

| | Standard NIH | `hard-needle` |

|---|---|---|

| Distractors | Generic prose | Semantically similar facts (multiple project codes, multiple deadlines, etc.) |

| Disambiguation | None — needle is unique | Required — model must bind value to entity |

| Eval pool isolation | N/A | Disjoint `default` / `unseen` entity pools to detect memorization |

| Output | Plain text | Structured `needle_records` with type, entity, value, char position, depth fraction |

| Negatives | None | Optional paired `corrupt_example` for contrastive eval |

Designed for: **honest long-context evaluation, contrastive training data, lost-in-the-middle studies with realistic confusion.**

## Quickstart (Python)

```python

from hard_needle import generate_hard_example, generate_dataset

ex = generate_hard_example(num_needles=3, ctx_chars=8000, seed=42)

print(ex["prompt"])           # full input prompt

print(ex["target"])           # gold answer

for r in ex["needle_records"]:

    print(r["type"], r["entity"], "->", r["value"], f"(depth={r['depth_frac']:.2f})")

ds = generate_dataset(

    num_examples=500,

    num_needles=6,

    ctx_chars=12000,

    pool_set="default",       # or "unseen" for held-out generalization eval

    include_corrupted=True,

    corruption_ratio=0.2,

    seed=0,

)

```

## CLI

```bash

hard-needle-generate \

    --num-examples 1000 \

    --num-needles 6 \

    --ctx-chars 12000 \

    --pool-set default \

    --include-corrupted \

    --corruption-ratio 0.2 \

    --seed 42 \

    --output train.jsonl

# Disjoint eval pool — no entity/value overlap with --pool-set default

hard-needle-generate \

    --num-examples 200 \

    --num-needles 6 \

    --ctx-chars 12000 \

    --pool-set unseen \

    --seed 100 \

    --output eval.jsonl

```

Each output line is a JSON object:

```json

{

  "prompt": "You are an internal assistant for the ...",

  "target": "ATL-7701",

  "text": " ",

  "question": "What is the special project code for Apollo?",

  "target_needle_type": "project_code",

  "target_entity": "Apollo",

  "target_value": "ATL-7701",

  "needle_records": [

    {

      "type": "project_code",

      "entity": "Aurora",

      "value": "ATL-7704",

      "sentence": "The special project code for Aurora is ATL-7704.",

      "char_pos": 1842,

      "depth_frac": 0.42

    }

  ],

  "num_needles": 3,

  "ctx_chars": 8000,

  "pool_set": "default",

  "is_corrupted": false

}

```

## Needle types

Each example uses one of four needle types — all entities are projects, but the value type varies:

| Type | Entity example | Value example |

|---|---|---|

| `project_code` | `Aurora` | `AUR-4521` |

| `deadline` | `Apollo` | `April 03` |

| `budget` | `Atlas` | `$1.4M` |

| `lead` | `Andromeda` | `Dr. Sarah Chen` |

Disjoint `unseen` pool uses different surface forms (e.g. `Brontis`, `BRX-9001`, `Dr. Aiko Tanaka`) for held-out generalization eval.

## Optional extras

```bash

pip install "hard-needle[datasets]"      # PG-19 streaming distractors (vs builtin pool)

pip install "hard-needle[tokenizer]"     # Token-aware truncation via transformers

pip install "hard-needle[dev]"           # pytest

```

## Limitations

- Context length is controlled in **characters** by default. Token-aware truncation requires `[tokenizer]` extra and is best-effort across needle insertions.

- Builtin distractor pool is small. Use `--distractor-source pg19` for production-scale data.

- Templates are deliberately simple ("The X for Y is Z"). For paraphrase-robustness studies, augment downstream.

## Citing / links

If `hard-needle` helped your research or evaluation, a star is appreciated. If you publish using it, drop a link to your work in the issues — happy to maintain a "used by" list.

## License

MIT — see [LICENSE](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/denial-web/hard-needle

Awesome Lists containing this project

README