https://github.com/leagames0221-sys/longctx-bench-honest

Honest measurement of 1M-token long-context benchmarks (RULER + LongBench v2 + NIAH) on Qwen2.5-7B-1M local vs GitHub Models cloud. All zero credit card, drift-checked, reproducible.
https://github.com/leagames0221-sys/longctx-bench-honest
benchmark bitsandbytes consumer-laptop github-models llm long-context niah portfolio qwen qwen2-5 transformers vllm
Last synced: 3 days ago
JSON representation
Honest measurement of 1M-token long-context benchmarks (RULER + LongBench v2 + NIAH) on Qwen2.5-7B-1M local vs GitHub Models cloud. All zero credit card, drift-checked, reproducible.
Host: GitHub
URL: https://github.com/leagames0221-sys/longctx-bench-honest
Owner: leagames0221-sys
License: mit
Created: 2026-05-11T12:09:29.000Z (24 days ago)
Default Branch: main
Last Pushed: 2026-05-25T13:56:44.000Z (10 days ago)
Last Synced: 2026-05-25T14:31:35.626Z (10 days ago)
Topics: benchmark, bitsandbytes, consumer-laptop, github-models, llm, long-context, niah, portfolio, qwen, qwen2-5, transformers, vllm
Language: Jupyter Notebook
Size: 1.11 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
- Notice: NOTICE.md
Awesome Lists containing this project

README

          # longctx-bench-honest

> **Honest measurement of 1M-token long-context benchmarks** on a consumer laptop.

> Local [Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M) vs cloud frontier models ([GPT-5](https://github.com/marketplace/models) / [Claude Sonnet](https://github.com/marketplace/models) / [Llama 3.3](https://github.com/marketplace/models) via GitHub Models) — measured side-by-side on [RULER](https://github.com/NVIDIA/RULER) + [LongBench v2](https://github.com/THUDM/LongBench) + [NIAH](https://github.com/gkamradt/LLMTest_NeedleInAHaystack).

> Zero credit card. Zero API cost (electricity only for local; free-tier for cloud). Drift-checked.

[![drift-check](https://github.com/leagames0221-sys/longctx-bench-honest/actions/workflows/drift-check.yml/badge.svg)](https://github.com/leagames0221-sys/longctx-bench-honest/actions/workflows/drift-check.yml)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

[![Constraint: zero credit card](https://img.shields.io/badge/Constraint-zero%20credit%20card-blue)](#selected-under)

[![Constraint: local LLM (default)](https://img.shields.io/badge/Constraint-local%20LLM%20%28default%29-blue)](#selected-under)

[![Constraint: free / OSS only](https://img.shields.io/badge/Constraint-free%20%2F%20OSS%20only-blue)](#selected-under)

[![Constraint: security defense-in-depth](https://img.shields.io/badge/Constraint-security%20defense--in--depth-blue)](#selected-under)

[![Constraint: consumer laptop](https://img.shields.io/badge/Constraint-consumer%20laptop-blue)](#selected-under)

[![Constraint: drift-CI enforced](https://img.shields.io/badge/Constraint-drift--CI%20enforced-blue)](#selected-under)

## Selected under

> **The 4-constraint set** (applied across the full portfolio — verified consistent across all 11 portfolio repos):

>

> 1. **Zero credit card** — no paid API / cloud service required for the default path. A reviewer can clone, install, and run with $0 spend and no payment method on file.

> 2. **Local LLM (default)** — when an LLM is involved, the default path is local (Ollama / similar) or deterministic mock. Paid cloud LLM is opt-in via env var, never default.

> 3. **Free / OSS only** — every runtime dependency is permissively-licensed open source (MIT / Apache-2.0 / BSD-3); no proprietary SDK at build time.

> 4. **Security defense-in-depth** — secrets-scan CI + `.gitignore` hardening, encrypted-at-rest where PII is involved, append-only audit logging where applicable, dep-vuln gating (`pip-audit` / `pnpm audit`), paid-API constructor gate where applicable.

>

> **Additional repo-specific constraints** (this repo applies 2 more on top of the 4 portfolio baseline):

>

> - **Consumer laptop only** — single workstation, no 8-GPU tensor parallel, no datacenter (local model = Qwen2.5-7B-Instruct-1M on consumer GPU; cloud comparison via GitHub Models free tier with `gh auth token`)

> - **Drift-CI enforced** — every README claim verified by [drift-check CI](.github/workflows/drift-check.yml); mismatch fails the build (cost-tier table numerics literal-matched against `artifacts/*.json` JSON evidence)

>

> **The thesis**: under these 6 constraints simultaneously, what's the literal best 1M-token long-context measurement buildable in 2026-05? This repo is the answer — every selection (LLM, benchmarks, comparison cloud models, eval methodology) has a sourced rationale in [decisionLog](memory_bank/decisionLog.md) explaining why alternatives were rejected.

>

> Portfolio category: **constraint-optimized AI engineering**.

## Why this is the literal best under the constraint set

Given (1) no CC, (2) consumer laptop, (3) literal 1M context, (4) 2026-05 industry state:

| Choice | Selected | Rejected alternatives + sourced reason |

|---|---|---|

| Local LLM | Qwen2.5-7B-Instruct-1M | Qwen3.6-27B (8 GPU tensor parallel required, [model card](https://huggingface.co/Qwen/Qwen3.6-27B)) / DeepSeek V4 (284B params, consumer infeasible) / Gemma 4 26B (Apache-2.0, but 1M extension not literal default) |

| Cloud comparison API | GitHub Models free tier | Anthropic API (CC required) / OpenAI API (CC required) / Gemini paid (CC required) |

| Benchmark main | RULER + LongBench v2 | NIAH alone ([saturated per industry consensus](https://nrehiew.github.io/blog/long_context/)) / InfiniteBench (less reasoning depth) |

| Benchmark supplement | NIAH (heatmap visual only) | drop entirely (loses recruiter visual recognition) |

| Inference engine | vllm | llama.cpp (slower at long context) / TGI (heavier setup) |

| Drift discipline | `.github/workflows/drift-check.yml` (13 verify steps) | none (= silent drift, the structural failure mode) |

Each rejected option has a sourced reason in [decisionLog](memory_bank/decisionLog.md). The 2-row ADR self-correction history (Qwen2.5-repo hallucination → Qwen3.6-27B 8-GPU discovery → Qwen2.5-7B-1M literal confirmed) is preserved as evidence of constraint-driven option-space audit.

## What this is

A reproducible benchmark repo that does one thing: **measure 4 long-context LLMs across 3 industry-current benchmarks, honestly publish all numbers (good or bad), and prove drift-free via CI**.

The portfolio thesis: in 2026-05, anyone can claim "I ran a 1M-context model." Few can show *which benchmarks*, *which numbers*, *which model lost where*, *and the exact reproducible cost* — all without spending a yen. That's the gap this repo closes.

## Status

**Phase 0 closed** — Scaffolds installed (drift CI / memory_bank / Tier 2 CLAUDE.md / spec.md). Overhaul commit reflects 2026-05 industry state (Qwen3.6/DeepSeek V4 frontier require 8 GPU; Qwen2.5-7B-1M is the consumer-laptop sweet spot for real 1M inference).

**Phase 1 partial (2026-05-12)** — Install layer GREEN (CUDA torch 2.5.1+cu124 + bitsandbytes 0.49.2 int4 NF4 + transformers 5.8.0). Qwen 1M weight (14.22GB) DL'd to D:\hf_cache. **Single-needle NIAH baseline literal ran on consumer hardware (RTX 3050 Laptop 6GB VRAM)**: 4k context PASS / 5k+ OOM. See [Honest results](#honest-results-phase-1-partial-evidence) and [decisionLog ADR-007](memory_bank/decisionLog.md) for the literal VRAM ceiling characterization.

**Phase 2a (2026-05-12)** — Cloud comparison via GitHub Models free tier (zero credit card, gh OAuth only). 4 model attempts at matched 4k context: **gpt-4.1-mini PASS (8.54s)**, **llama-3.3-70b-instruct PASS (5.17s)** — both ~30-50x faster than local Qwen 4k. **gpt-5 literal UNAVAILABLE on free tier** (catalog says "available" but inference returns `unavailable_model`). **gpt-5 + deepseek-v3 = hard 4000-token request cap** documented per literal API error. See [Cloud free-tier honest map](#cloud-free-tier-honest-map-phase-2a-evidence) and [decisionLog ADR-008](memory_bank/decisionLog.md). Anthropic Claude **not present in GitHub Models catalog** at all.

**Phase 2b (2026-05-12) — NEGATIVE RESULT, sourced**: WSL2 + vllm 0.7.3 + bitsandbytes int4 literal cannot fit Qwen2.5-7B-1M on 6GB VRAM. vllm's memory profile shows model weights = 5.43GiB + activation peak = 1.42GiB > 6.00GiB total → KV cache budget = literal -0.94GiB, 0 GPU cache blocks allocated, 0x concurrency. **Linux/vllm has no Windows-equivalent shared-memory PCIe spillover fallback** — the Phase 1 Windows transformers 4k cell was literally enabled by Windows OS-level memory overcommit, not by the inference engine. See [decisionLog ADR-009](memory_bank/decisionLog.md) for the literal vllm log evidence. Phase 3 (craftstack 2-repo unification) is the remaining work; documented as next-session candidate.

## Verified state (drift-checked by CI)

| Item | Expected | Verified by |

|---|---|---|

| License | MIT | `.github/workflows/drift-check.yml` |

| Memory Bank (Cline pattern) | 5 files in `memory_bank/` | drift-check |

| Tier 2 PJ rules | `CLAUDE.md` at repo root | drift-check |

| Spec SSoT | `spec.md` at repo root | drift-check |

| Drift CI | `.github/workflows/drift-check.yml` exists | drift-check |

| Phase claim | Phase 0 (scaffolds + overhaul done) | manual update on phase transition |

| Benchmark scope | README references RULER + LongBench v2 + NIAH | drift-check |

| Model scope | README references Qwen2.5-7B-1M + GitHub Models | drift-check |

| Repo name canon | All internal references use `longctx-bench-honest` | drift-check |

## Cost-tier transparency table

Phase 1 partial result populates the local 4k cell with literal JSON evidence. Larger context cells for the local column carry `OOM @ 6GB VRAM` markers backed by literal failed-run JSON evidence in `artifacts/`. Cloud columns populate in Phase 2.

> **Note on cloud model selection** (literal honest finding): the original Phase 0 plan referenced "GPT-5 / Claude Sonnet / Llama 3.3". The 2026-05-12 Phase 2a literal probe of the GitHub Models catalog API found: **Anthropic Claude is NOT present in the catalog at all** (zero CC + GitHub Models = no Claude access), and **gpt-5 returns `unavailable_model` on this free-tier account** (catalog-listed but inference-unavailable). The table substitutes 4 actually-reachable models for honest comparison.

> **Cell status legend** — every cell either has JSON evidence or is honestly marked as not-yet-measured:

> - ✅ **MEASURED** — cell value backed by JSON evidence under `artifacts/`, drift-CI enforces match

> - ❌ **MEASURED FAILURE** (OOM / TOKEN_LIMIT / UNAVAILABLE) — failure mode literal observed, JSON evidence committed

> - ⏳ **NOT MEASURED** — cell deliberately empty; would be feasible per catalog limits but Phase 2a focused on the 4k cell (matches local 4k for 1:1 comparison). Phase 2b/3 candidate.

> - ⛔ **STRUCTURAL** — infeasible per literal hardware or free-tier constraints (sourced in ADR-007/008/009)

>

> Inference wall-time row footnote: cloud cells include only inference latency (model already loaded server-side); local cell includes 74s cold model load + 178s inference. Pure inference-only ratio is ~21x faster (cloud) vs cold-load-inclusive ratio of ~30x.

| Benchmark | Qwen2.5-7B-1M (local, int4 NF4) | gpt-4.1-mini (GitHub Models) | llama-3.3-70b-instruct (GitHub Models) | deepseek-v3-0324 (GitHub Models) | gpt-5 (GitHub Models) |

|---|---|---|---|---|---|

| NIAH single needle @ 2k | ⏳ (local 4k is primary cell) | ⏳ | ⏳ | ✅ **PASS 1.72s** — [evidence](artifacts/cloud_deepseek-deepseek-v3-0324_2000.json) | ❌ **UNAVAILABLE** (model not accessible on free tier) — [evidence](artifacts/cloud_openai-gpt-5_2000.json) |

| NIAH single needle @ 4k | ✅ **PASS 252s** (peak 10.8GB via Win shared-mem; cold load 74s + inference 178s) — [evidence](artifacts/baseline_4000.json) | ✅ **PASS 8.54s** (prompt=3723 tok) — [evidence](artifacts/cloud_openai-gpt-4-1-mini_4000.json) | ✅ **PASS 5.17s** (prompt=3856 tok) — [evidence](artifacts/cloud_meta-llama-3-3-70b-instruct_4000.json) | ❌ **TOKEN_LIMIT** (free-tier cap=4000) — [evidence](artifacts/cloud_deepseek-deepseek-v3-0324_4000.json) | ❌ **TOKEN_LIMIT** (free-tier cap=4000) — [evidence](artifacts/cloud_openai-gpt-5_4000.json) |

| NIAH single needle @ 5k | ❌ **OOM** (alloc 2.46GB on 11.18GB-used GPU) — [evidence](artifacts/baseline_5000.json) | ⏳ (catalog: 1M input — Phase 2b/3 candidate) | ⏳ (catalog: 128k input — Phase 2b/3 candidate) | ⛔ TOKEN_LIMIT predicted (4000 cap) | ⛔ free-tier unavailable |

| NIAH single needle @ 6k | ❌ **OOM** (alloc 3.57GB on 9.35GB-used GPU) — [evidence](artifacts/baseline_6000.json) | ⏳ | ⏳ | ⛔ TOKEN_LIMIT predicted | ⛔ free-tier unavailable |

| NIAH single needle @ 8k | ❌ **OOM** (alloc 6.43GB single block > 6GB GPU) — [evidence](artifacts/baseline_8000.json) | ⏳ | ⏳ | ⛔ TOKEN_LIMIT predicted | ⛔ free-tier unavailable |

| RULER (13-task avg) | ⛔ requires ≥16k context per task — infeasible on 6GB VRAM | ⏳ feasible (1M catalog) — Phase 2b/3 candidate | ⏳ feasible (128k catalog) — Phase 2b/3 candidate | ⛔ infeasible (4000 free-tier cap) | ⛔ free-tier unavailable |

| LongBench v2 (acc) | ⛔ typical task 32k-128k — infeasible on 6GB VRAM | ⏳ feasible — Phase 2b/3 candidate | ⏳ feasible at 128k cap | ⛔ infeasible (4000 free-tier cap) | ⛔ free-tier unavailable |

| NIAH 128k+ heatmap | ⛔ infeasible on 6GB VRAM (would need 24GB+ or multi-GPU) | ⏳ feasible (1M catalog) — would consume free-tier quota | ⏳ feasible at 128k | ⛔ infeasible (4000 free-tier cap) | ⛔ free-tier unavailable |

| Inference wall-time @ 4k | ✅ 252s incl. 74s cold load (~178s inference-only) | ✅ **8.54s** (~21x faster than local inference-only, ~30x faster than local cold-load-incl.) | ✅ **5.17s** (~34x faster vs local inference-only, ~49x faster vs cold-load-incl.) | ✅ 1.72s @ 2k cell (4k cell hit TOKEN_LIMIT) | ❌ n/a (model unavailable) |

| Cost per measurement run | ✅ electricity only (~¥1) | ✅ free-tier, no CC | ✅ free-tier, no CC | ✅ free-tier, no CC | ⛔ n/a (model unavailable) |

| Credit card required | no | no (GitHub OAuth token only) | no (GitHub OAuth token only) | no (GitHub OAuth token only) | no (but model inaccessible regardless) |

**Sample-size disclosure (★★)**: each cell is `n=1` (single seed=42 × single depth=50% × single 7-digit magic-number needle). Industry NIAH benchmarks typically run multi-depth × multi-seed grids; this portfolio's cells are single-point measurements meant to characterize the literal hardware/cloud ceiling, not full statistical distributions. Multi-depth heatmap is a Phase 2b/3 candidate (feasible within the 4k local ceiling: 5+ depths × n=3 seeds ≈ 30 minutes of measurement budget).

**Hardware constraint literally hit**: at int4 NF4 quantization, model weights occupy ~4GB of the 6GB VRAM; inference activations + KV cache exceed available headroom beyond 4k input tokens. Cumulative VRAM demand at 4k = 10.8GB peak (rescued by Windows shared-memory spillover via PCIe, ~10x slower than pure VRAM). At 5k+, a single allocation in the attention forward pass requires more contiguous VRAM than physically available. This is the literal *constraint-optimized AI engineering* boundary on this hardware tier.

**WSL2 + vllm test (Phase 2b, NEGATIVE RESULT)** — Tried PagedAttention via vllm 0.7.3 + bitsandbytes int4 in WSL2 Ubuntu 24.04. vllm memory profile literal evidence ([wsl_vllm_4000.json](artifacts/wsl_vllm_4000.json)): model weights 5.43GiB + activation peak 1.42GiB = 6.85GiB > 6.00GiB physical → 0 GPU cache blocks, OOM before any inference. **Linux/vllm provides no shared-memory PCIe spillover** — the Windows transformers 4k PASS was structurally dependent on Windows OS-level memory overcommit. Conventional wisdom "Linux/vllm > Windows/transformers for memory efficiency" is literal disproven at this hardware tier. See [ADR-009](memory_bank/decisionLog.md).

### Visual summary

![NIAH Phase 1 heatmap](docs/heatmap/niah_phase1.png)

Auto-rendered from `artifacts/*.json` by [docs/heatmap/render.py](docs/heatmap/render.py)

(matplotlib + numpy, no network egress). Cells follow the cost-tier table above:

`PASS` = green, `OOM` = red (local hardware ceiling), `TOKEN_LIMIT` = yellow

(cloud free-tier cap), `ERROR` = dark red (model unavailable), `N/A` = grey

(not measured in Phase 1). Regenerate after adding artifacts:

```bash

python docs/heatmap/render.py

```

## Phase plan

| Phase | Scope | End gate |

|---|---|---|

| **0 (closed)** | scaffold install + overhaul (Qwen2.5-7B-1M + RULER + LongBench v2 + NIAH direction set) | drift CI green on first push + overhaul commit |

| 1 | vllm install + Qwen2.5-7B-1M weight DL + 3 benchmark repo clone/audit + baseline 128k | `pytest` green + baseline RULER subset run |

| 2 | Full 4-model x 3-benchmark sweep + heatmap + honest results section populated | All cost-tier cells filled with JSON evidence + drift CI extended to verify numbers |

| 3 | craftstack integration + r/LocalLLaMA + HN post | craftstack 上位 fold link populated |

## Honest results (Phase 1 partial evidence)

### Where the local 7B model holds up

**NIAH single needle @ 4k context** ✅ — Qwen2.5-7B-Instruct-1M in int4 NF4 quantization on RTX 3050 6GB Laptop correctly extracts a 7-digit magic number planted at 50% depth in a Paul Graham essay haystack. Output: the literal number, nothing else. JSON: [artifacts/baseline_4000.json](artifacts/baseline_4000.json). Inference wall-time: 252 seconds. Cost: ~¥1 of electricity.

### Where the constraint literally hits (hardware ceiling)

| context | result | root cause |

|---|---|---|

| 4k | PASS, 252s, peak 10.8GB | barely fits with Windows shared-mem PCIe spillover |

| 5k | OOM | single alloc 2.46GB on 11.18GB-used GPU — shared-mem fallback exhausted |

| 6k | OOM | single alloc 3.57GB on 9.35GB-used GPU |

| 8k | OOM | single attention forward pass needs 6.43GB contiguous — exceeds 6GB total VRAM |

| 128k / 1M (model design max) | not attempted, predicted infeasible | KV cache alone for 128k context (~7GB) exceeds 6GB VRAM, before model weights |

This is the **literal `constraint-optimized AI engineering` boundary on RTX 3050 6GB Laptop tier**. The model itself is 1M-context capable per its [config.json](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M/blob/main/config.json) (`max_position_embeddings: 1010000`, `dual_chunk_attention_config`). The bottleneck is not the model architecture — it's that 7B parameters × int4 (4GB) + KV cache (~57KB/token × N) saturates a 6GB VRAM budget by N ≈ 4000 tokens.

### Where reasonable engineering fixes the gap (for future Phase 2/3 work)

1. **Chunked decoding + scratchpad re-injection** — split a long-context task into 4k-context windows; preserves consumer-hardware feasibility at the cost of 10-20x wall-time and ~5-15% accuracy degradation (industry observation from RAG benchmarks).

2. **vllm + WSL2 with PagedAttention** — Windows hosts can't run vllm natively, but WSL2 (free, no CC) can. PagedAttention is more KV-cache efficient than transformers + bitsandbytes; may push ceiling to ~8-16k on the same hardware. (Phase 2 candidate.)

3. **Cloud frontier via GitHub Models free tier** — direct 128k+ inference where the local hardware caps out. Constraint: free-tier 8000 token request cap (verified in [browser-agent-demo v5 logbook](https://github.com/leagames0221-sys/browser-agent-demo/blob/main/memory_bank/logbook.md#phase-2-v4--v5)), so even cloud frontier hits a `zero CC` boundary above ~6000 input tokens.

### Where it doesn't (and a paid frontier is the literal honest answer)

Full 1M-context honest measurement requires either (a) a 24GB+ VRAM workstation GPU (not consumer-laptop tier) or (b) a paid frontier API (GPT-5 1M / Claude 4.7 1M / Gemini 2.0 2M) — both fall outside `consumer laptop` and `zero credit card` constraints respectively. This portfolio is the literal honest map of what's measurable in the intersection of both constraints; the 4k ceiling is the answer, not a failure.

## Cloud free-tier honest map (Phase 2a evidence)

Phase 2a literal probe of GitHub Models free tier (zero credit card, gh OAuth token only, 2026-05-12) produced the literal accessibility matrix:

| Model | Catalog claim | Free-tier reality |

|---|---|---|

| openai/gpt-4.1-mini | 1M input / 32k output / "low" tier | ✅ **PASS @ 4k** (8.54s, prompt=3723 tok) — no hard cap encountered at 4k |

| meta/llama-3.3-70b-instruct | 128k input / 4k output / "high" tier | ✅ **PASS @ 4k** (5.17s, prompt=3856 tok) — no hard cap encountered at 4k |

| deepseek/deepseek-v3-0324 | 128k input / 4k output / "high" tier | ✅ PASS @ 2k (1.72s) / ❌ **TOKEN_LIMIT @ 4k** (literal error: "Max size: 4000 tokens") |

| openai/gpt-5 | 200k input / 100k output / "custom" tier | ❌ **UNAVAILABLE @ 2k** (literal error: "Unavailable model: gpt-5") — catalog-listed but inference-inaccessible |

| anthropic/claude-* | — | ❌ **NOT IN CATALOG** — zero Anthropic models present in GitHub Models marketplace |

**Honest portfolio finding ★★★★** (the literal cloud-side counterpart to the local 4k VRAM ceiling):

> Under `zero credit card`, the literal reachable cloud frontier at 4k context is **gpt-4.1-mini + llama-3.3-70b-instruct**. Both are 30-50x faster than local Qwen 4k. **gpt-5, Claude, Gemini, and any 1M-context cloud test at scale all require either a paid API or a higher GitHub Models tier — outside the constraint set this portfolio commits to.**

Citation chain for the literal API responses: [decisionLog ADR-008](memory_bank/decisionLog.md). Reproducible via `examples/cloud_niah.py --model  --context-tokens ` with a `gh auth token` in the `GITHUB_TOKEN` environment variable.

## Quickstart

Phase 1 install + Phase 2 run path is documented in [SETUP.md](SETUP.md).

The 7 steps below are the literal sequence (full prerequisites, hash

verification, and per-step `Verify` blocks live in SETUP.md):

```powershell

# 1. D: drive cache redirect (Windows host, 15GB Qwen weight off C:)

[Environment]::SetEnvironmentVariable("HF_HOME", "D:\hf_cache", "User")

# (restart PowerShell, then continue)

# 2. Download Qwen2.5-7B-Instruct-1M weight (~15GB, 30min-2h depending on bandwidth)

pip install --upgrade huggingface_hub

hf download Qwen/Qwen2.5-7B-Instruct-1M --cache-dir "D:\hf_cache\hub"

# 3. Python deps (uv + D: venv; WSL2 path for vllm, see SETUP.md Step 3)

$env:UV_PROJECT_ENVIRONMENT = "D:\venvs\longctx-bench-honest"

git clone https://github.com/leagames0221-sys/longctx-bench-honest.git

cd longctx-bench-honest

uv sync

# 4. Supply chain audit (rubric #15 gate)

uv run pip-audit --strict

# 5. GitHub Models token (free tier, no credit card; see SETUP.md Step 5)

"GITHUB_TOKEN=ghp_..." | Out-File .env -Encoding utf8 -NoNewline

# 6. Baseline run (single cell, ~5 minutes; reproduces artifacts/baseline_4000.json)

uv run python examples/baseline_niah.py --context-tokens 4000 --depth-pct 50

# 7. Cloud comparison (one model, ~10 seconds; reproduces artifacts/cloud_*.json)

uv run python examples/cloud_niah.py --model openai/gpt-4.1-mini --context-tokens 4000

```

The full 7-cell sweep (4k/5k/6k/8k local + 2k/4k cloud × N models) is

documented in SETUP.md Step 6a-6c with expected `status` field per cell

(PASS / OOM / TOKEN_LIMIT). The drift-check CI verifies that every

artifact JSON matches the cost-tier table claims on every push.

## Disk layout (consumer laptop constraint, 15GB model weight)

The Qwen2.5-7B-Instruct-1M weight is ~15GB. To preserve C: drive capacity (Windows recommends 15-20% free), this repo redirects HuggingFace cache and the Python venv to D: drive:

```powershell

# Set once per user (persistent)

[Environment]::SetEnvironmentVariable("HF_HOME", "D:\hf_cache", "User")

[Environment]::SetEnvironmentVariable("HF_HUB_CACHE", "D:\hf_cache\hub", "User")

# venv on D: (uv supports custom env path)

$env:UV_PROJECT_ENVIRONMENT = "D:\venvs\longctx-bench-honest"

uv sync

```

**Lifecycle**: D: footprint (`hf_cache` ~15GB + `venvs` ~5GB) is needed only during Phase 1 install + Phase 2 measurement. Once Phase 2 JSON evidence + heatmap PNG is pushed to this repo, **D: cache is safe to delete**. The repo itself is self-contained (code + JSON + PNG = a few MB).

If a third party clones this repo and wants to re-run, the `## Quickstart` section in Phase 1 documents the same D: redirect pattern (or any drive with ≥20GB free).

## Architecture

Phase 1 populates architecture diagram. Phase 0 scaffold structure:

```

.

├── CLAUDE.md               # Tier 2 PJ rules

├── spec.md                 # PJ spec SSoT

├── memory_bank/            # Cline pattern session handoff (5 files)

├── .claude/                # Tier 2 dir (skills/agents/commands/hooks)

├── .github/workflows/      # drift CI

└── LICENSE                 # MIT

```

## Memory Bank

`memory_bank/` follows the [Cline Memory Bank pattern](https://docs.cline.bot/prompting/cline-memory-bank): logbook (append-only events), activeContext (current focus), decisionLog (ADRs including the 2026-05 overhaul rationale), productContext (what/why), systemPatterns (how).

## Drift prevention

This repo treats doc/code drift as a structural failure mode. The `.github/workflows/drift-check.yml` CI runs on every push + PR and fails if claims in this README do not match repo reality. Phase 2 extends drift-check to verify that numeric claims in the cost-tier table match the JSON evidence under `artifacts/`.

## Why Qwen2.5-7B-1M (and not Qwen3.6 / DeepSeek V4)

Frontier 1M-context models in 2026-05 (Qwen3.6-27B, Qwen3.5-35B-A3B, DeepSeek V4, Gemma 4 26B) require multi-GPU tensor parallel for real 1M inference. See [Qwen3.6-27B model card](https://huggingface.co/Qwen/Qwen3.6-27B) — recommended `--tensor-parallel-size 8`.

The portfolio constraint is consumer laptop (single workstation, no datacenter). Qwen2.5-7B-Instruct-1M is the 2024-2025 model that genuinely runs 1M context on consumer hardware. The portfolio value is *not* "I run the newest model" — it's "I make honest measurements under a real constraint, and I show where the constraint hurts."

See [decisionLog ADR-001r2](memory_bank/decisionLog.md) for the full reasoning, including the two earlier hallucinated recommendations that this overhaul corrects.

## License

MIT — see [LICENSE](LICENSE).

## Prior art

- [Qwen/Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M) — Apache-2.0, 1M context LLM

- [vllm-project/vllm](https://github.com/vllm-project/vllm) — Apache-2.0, inference engine

- [NVIDIA/RULER](https://github.com/NVIDIA/RULER) — Apache-2.0, 13-task long-context benchmark (industry-current, NIAH successor)

- [THUDM/LongBench](https://github.com/THUDM/LongBench) — repo (LongBench v2, ACL 2025), 503 MCQ for reasoning depth

- [gkamradt/LLMTest_NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack) — MIT, NIAH visualizer (kept as supplementary heatmap)

- [GitHub Models](https://github.com/marketplace/models) — free-tier OpenAI-compatible API for GPT-5/Claude/Llama (no CC)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/leagames0221-sys/longctx-bench-honest

Awesome Lists containing this project

README