{"id":50523862,"url":"https://github.com/metehan777/http-header-link-graph","last_synced_at":"2026-06-03T06:31:32.465Z","repository":{"id":355226225,"uuid":"1227282482","full_name":"metehan777/http-header-link-graph","owner":"metehan777","description":"Publish a site's link graph \u0026 heading map in HTTP response headers. Crawl 65k pages in 99 seconds without parsing one byte of HTML. Companion code for the SEO Week 2026 NYC experiment.","archived":false,"fork":false,"pushed_at":"2026-05-02T14:05:03.000Z","size":100,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-02T15:27:06.366Z","etag":null,"topics":["aeo","answer-engine-optimization","cloudflare-workers","crawler","generative-engine-optimization","geo","http-headers","link-graph","python","rust","seo","site-architecture","technical-seo"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/metehan777.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-02T13:17:47.000Z","updated_at":"2026-05-02T14:05:07.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/metehan777/http-header-link-graph","commit_stats":null,"previous_names":["metehan777/http-header-link-graph"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/metehan777/http-header-link-graph","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metehan777%2Fhttp-header-link-graph","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metehan777%2Fhttp-header-link-graph/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metehan777%2Fhttp-header-link-graph/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metehan777%2Fhttp-header-link-graph/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/metehan777","download_url":"https://codeload.github.com/metehan777/http-header-link-graph/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metehan777%2Fhttp-header-link-graph/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33852289,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-03T02:00:06.370Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aeo","answer-engine-optimization","cloudflare-workers","crawler","generative-engine-optimization","geo","http-headers","link-graph","python","rust","seo","site-architecture","technical-seo"],"created_at":"2026-06-03T06:31:30.821Z","updated_at":"2026-06-03T06:31:32.454Z","avatar_url":"https://github.com/metehan777.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# http-header — publishing a site's link graph and heading map in HTTP response headers\n\n\u003e Companion code for the post: [\"I crawled 65,000 pages of my own site without parsing a single line of HTML\"][(https://metehan.ai/blog/http-headers-internal-links](https://metehan.ai/blog/http-headers-internal-links)] (the idea was sketched at SEO Week 2026, NYC, organized by iPullRank).\n\nThis repo is a working experiment in publishing a page's structural metadata — its outbound internal links and its heading hierarchy — directly inside HTTP response headers, so crawlers, agents, and your own SEO tooling can read them without parsing any HTML.\n\nA demo site (`data.stateglobe.com`, ~65k pages) emits two custom headers on every page:\n\n```http\nX-Internal-Links: \u003cbase64url(JSON array of relative paths)\u003e\nX-Internal-Links-Encoding: json+base64url\nX-Internal-Links-Count: 31\nX-Internal-Links-Bytes: 1455\n\nX-Headings: \u003cbase64url(JSON array of {l: 1-6, t: string})\u003e\nX-Headings-Encoding: json+base64url\nX-Headings-Count: 8\nX-Headings-Bytes: 534\nX-Headings-Schema: [{l:1-6,t:string}]\n\nAccess-Control-Expose-Headers: X-Internal-Links, X-Internal-Links-Encoding,\n  X-Internal-Links-Count, X-Internal-Links-Bytes, X-Headings,\n  X-Headings-Encoding, X-Headings-Count, X-Headings-Bytes, X-Headings-Schema\n```\n\nThen a Rust crawler walks the entire graph in seconds without parsing one byte of HTML.\n\n## What's in this repo\n\n```\nsrc/\n  headers.ts             ⭐ Drop-in TS module: attachStructuralHeaders()\n                            Enforces a combined byte budget (default 12 KB)\n                            and gracefully truncates so your origin never 500s.\n                            Pure, framework-agnostic, no runtime deps.\n  index.ts               Cloudflare Worker reference implementation that uses it\nrust-probe/              Rust crawler that reads only response headers (reqwest + tokio)\nscripts/\n  probe_100.py           100-URL targeted probe; captures BOTH X-Internal-Links + X-Headings\n  seo_header_probe.py    Python asyncio header-only crawler (raw sockets)\n  seo_header_probe_fast.py httpx + HTTP/2 + sitemap-seeded crawler\n  seo_insights.py        Builds SEO insights from a crawl summary (hubs, orphans,\n                         click depth, clusters, payload risk, equity Gini)\n  render_link_graph.py   Force-directed D3 graph visualization\n  test-headers-budget.mjs Stress test for the budget cap (5,000 links + headings)\nreports/probe-100/       Sample 100-URL fresh-cache probe output\nblog/                    Long-form post about the experiment\nwrangler.jsonc           Cloudflare Worker config\n```\n\n## The drop-in module\n\nIf you only want one thing from this repo, take this:\n\n```ts\nimport { attachStructuralHeaders } from \"./src/headers\";\n\nreturn attachStructuralHeaders(\n  new Response(html, { status: 200 }),\n  {\n    url: req.url,\n    links: getInternalLinks(page),  // can be huge, will be safely capped\n    headings: getHeadings(page),    // can be huge, will be safely capped\n  }\n  // Defaults: 6 KB per header, 12 KB combined.\n  // Truncated payloads emit X-Internal-Links-Truncated: 1 + X-Internal-Links-Original: N\n  // for monitoring.\n);\n```\n\nIt works in **Cloudflare Workers, Next.js middleware, Deno, Bun, Node 18+** — anywhere a `Response` and `TextEncoder` exist.\n\nVerify it never overflows:\n\n```bash\nnpm run test:budget\n# 5 passed, 0 failed\n```\n\n## Quick start\n\n```bash\n# 1. install Worker deps and run locally\nnpm install\nnpm run dev\n\n# 2. quick local sanity check — you should see X-Internal-Links populated\ncurl -sI http://127.0.0.1:8787/ | grep -i x-internal\n\n# 3. deploy to Cloudflare (uses your wrangler login)\nnpm run deploy\n\n# 4. run a 100-URL header probe against the deployed site\npython3 -m pip install 'httpx[http2]'\npython3 scripts/probe_100.py \\\n  --base-url https://your-domain.example.com \\\n  --count 100 \\\n  --concurrency 16 \\\n  --out-dir reports/probe-100\n```\n\n## Building and running the Rust crawler\n\n```bash\ncd rust-probe\ncargo build --release\n./target/release/header-probe \\\n  --base-url https://your-domain.example.com \\\n  --requests 70000 \\\n  --concurrency 800 \\\n  --timeout 30 \\\n  --out-dir reports/full-run\n```\n\nIt seeds the queue from `/sitemap.xml`, makes a single GET per URL, and reads only the `X-Internal-Links` header. On the demo site (`data.stateglobe.com`, 65k pages) the warm-cache run completes in **1m 39s at ~660 req/s** (peaks ~970 req/s).\n\n## Generating SEO insights from a crawl\n\nAfter a crawl writes `seo-header-summary.json`, run:\n\n```bash\npython3 scripts/seo_insights.py \\\n  --input reports/full-run/seo-header-summary.json \\\n  --out-dir reports/full-run/insights\n```\n\nYou get:\n\n- `seo-insights.md` (human-readable: hubs, orphans, dead-ends, click depth, clusters, equity Gini, payload risk, anomalies)\n- `seo-insights.json` (machine-readable)\n- `recrawl-list.txt` (URLs whose header was missing this run — re-crawl these to clean the dataset)\n\n## Production warning — read this before shipping it\n\n**This is an experiment.** If you do this wrong, you can break your own site. Specifically:\n\n1. **HTTP response header size limits are real and vary by server / CDN.** The combined size of all response headers must fit under your origin's limit (Cloudflare default ~16 KB, many origins enforce 8 KB or less). If you push too much JSON into too many custom headers, the origin will return a `5xx` to real users, not just to crawlers.\n2. **High-link or deep-heading hub pages are the danger zone.** A homepage with 200+ links and a long heading map can easily blow past 16 KB. Test every hub.\n3. **Always cap the payload defensively.** Implement a hard byte limit (e.g. 6 KB per header, 12 KB combined) and gracefully truncate or omit the header when over budget. Better to ship 50 of 200 links than to 500 the page.\n4. **Cache it at the edge.** The first crawl will hit your Worker for every URL (slow). Cache the response with `caches.default.put` and a sane `Cache-Control`, then purge once when the header shape changes.\n5. **Do not roll this out without your dev team.** Especially in enterprise. This touches your CDN config, your origin response-header budget, and your bot-handling rules. Coordinate with platform/SRE and SEO together. Run it on a small subset of pages first, monitor 5xx rates, and roll forward only after a clean staging run.\n6. **Scope this to sites you own.** It's a publishing technique for site owners, not a bypass tool for someone else's WAF.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmetehan777%2Fhttp-header-link-graph","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmetehan777%2Fhttp-header-link-graph","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmetehan777%2Fhttp-header-link-graph/lists"}