{"id":50138819,"url":"https://github.com/casoon/astro-crawler-policy","last_synced_at":"2026-05-24T00:03:03.690Z","repository":{"id":350206566,"uuid":"1205804155","full_name":"casoon/astro-crawler-policy","owner":"casoon","description":"Policy-first crawler control for Astro — generates robots.txt and llms.txt with presets, per-bot rules, AI crawler registry, and build-time audits.","archived":false,"fork":false,"pushed_at":"2026-04-09T10:31:39.000Z","size":36,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-09T11:25:36.639Z","etag":null,"topics":["ai-crawler","astro","astro-integration","crawler","llms-txt","robots-txt","seo","typescript"],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/casoon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-09T09:42:55.000Z","updated_at":"2026-04-09T10:31:43.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/casoon/astro-crawler-policy","commit_stats":null,"previous_names":["casoon/astro-crawler-policy"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/casoon/astro-crawler-policy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/casoon%2Fastro-crawler-policy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/casoon%2Fastro-crawler-policy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/casoon%2Fastro-crawler-policy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/casoon%2Fastro-crawler-policy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/casoon","download_url":"https://codeload.github.com/casoon/astro-crawler-policy/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/casoon%2Fastro-crawler-policy/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33416316,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-23T22:14:44.296Z","status":"ssl_error","status_checked_at":"2026-05-23T22:14:43.778Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-crawler","astro","astro-integration","crawler","llms-txt","robots-txt","seo","typescript"],"created_at":"2026-05-24T00:02:58.148Z","updated_at":"2026-05-24T00:03:03.680Z","avatar_url":"https://github.com/casoon.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003e **This package is no longer actively maintained.**\n\u003e It has been superseded by [@casoon/astro-site-files](https://github.com/casoon/astro-site-files), which bundles robots.txt, llms.txt, sitemap.xml, security.txt, and humans.txt in a single integration. New features and fixes will only be made there.\n\n# @casoon/astro-crawler-policy\n\nPolicy-first crawler control for Astro. Generates `robots.txt` (and optionally `llms.txt`) from a typed configuration at build time.\n\n## What it does\n\n- Generates `robots.txt` from a typed configuration — no manual file editing required\n- Applies one of five built-in **presets** covering the most common use cases\n- Supports **content signals** (`search`, `ai-input`, `ai-train`) for newer crawler directives\n- Includes a **bot registry** with 13 known crawlers for per-bot and group-based rules\n- **Merges** the generated output with an existing `public/robots.txt` (replace / prepend / append)\n- Runs **build-time audits** that warn about common misconfigurations\n- Optionally generates **`llms.txt`** — a markdown summary of the AI content policy\n- Supports **environment-specific overrides** (e.g. lockdown on staging)\n\nThis plugin renders crawler policy. It does not enforce blocking at the network, WAF, or edge layer.\n\n## Installation\n\n```sh\nnpm install @casoon/astro-crawler-policy\n```\n\n## Quick start\n\n```ts\n// astro.config.ts\nimport { defineConfig } from 'astro/config';\nimport crawlerPolicy from '@casoon/astro-crawler-policy';\n\nexport default defineConfig({\n  site: 'https://example.com',\n  integrations: [\n    crawlerPolicy({\n      preset: 'citationFriendly',\n      sitemaps: ['/sitemap-index.xml']\n    })\n  ]\n});\n```\n\nThe plugin hooks into `astro:build:done` and writes `dist/robots.txt`. With just these two options you get sensible defaults: search engines allowed, verified AI bots allowed for citation, AI training bots blocked.\n\n## Presets\n\nPresets are the primary way to express intent. Each preset sets default content signals and group-level rules.\n\n| Preset | Search | AI citation | AI training | Unknown AI |\n|---|---|---|---|---|\n| `seoOnly` | allow | disallow | disallow | disallow |\n| `citationFriendly` *(default)* | allow | allow | disallow | disallow |\n| `openToAi` | allow | allow | allow | allow |\n| `blockTraining` | allow | allow | disallow | disallow |\n| `lockdown` | disallow | disallow | disallow | disallow |\n\n`citationFriendly` allows bots that do citation or summarization but blocks bots whose only purpose is training data collection (GPTBot, Google-Extended, CCBot, Bytespider, Applebot-Extended). Bots with mixed roles like ClaudeBot are allowed.\n\n`blockTraining` goes further and blocks every bot with any training category, including mixed bots like ClaudeBot and meta-externalagent.\n\n`lockdown` adds a global `User-agent: * / Disallow: /` rule, overriding everything.\n\n## Content signals\n\nContent signals are non-standard directives appended to the wildcard `User-agent: *` block:\n\n```\nUser-agent: *\nContent-Signal: search=yes, ai-input=yes, ai-train=no\nAllow: /\n```\n\nThey communicate intent to crawlers that support them. The three signals map to:\n\n| Signal | Meaning |\n|---|---|\n| `search` | Indexing for traditional search engines |\n| `aiInput` | Using content as input for AI responses (citation, summarization) |\n| `aiTrain` | Using content as AI training data |\n\nThe directive name and signal keys follow the [contentsignals.org](https://contentsignals.org) specification (proposed IETF aipref standard). Google Search Console may flag them as unrecognised directives — the audit system emits an `info` message when they are present.\n\nEach preset sets default values for all three signals. You can override them individually:\n\n```ts\ncrawlerPolicy({\n  preset: 'citationFriendly',\n  contentSignals: {\n    aiTrain: true  // override just this one; search and aiInput come from the preset\n  }\n})\n```\n\n## Groups and per-bot rules\n\nRules are resolved in layers, from least to most specific:\n\n1. **Preset** — sets group-level defaults\n2. **`groups`** — overrides for entire bot categories\n3. **`bots`** — overrides for individual bots by registry ID\n\nA bot's final action is the most specific rule that applies to it. An explicit entry in `bots` always wins over a `groups` setting.\n\n```ts\ncrawlerPolicy({\n  preset: 'citationFriendly',\n\n  // Override an entire group\n  groups: {\n    searchEngines: 'allow',  // default\n    verifiedAi: 'allow',     // default\n    unknownAi: 'disallow'    // default\n  },\n\n  // Override individual bots (takes precedence over groups)\n  bots: {\n    GPTBot: 'disallow',   // blocks this bot even if verifiedAi is 'allow'\n    ClaudeBot: 'allow'    // allows this bot even if verifiedAi were 'disallow'\n  }\n})\n```\n\nThe three groups are:\n- **`searchEngines`** — bots with category `search` (Googlebot, Bingbot)\n- **`verifiedAi`** — verified bots with AI categories (`ai-search`, `ai-input`, `ai-training`)\n- **`unknownAi`** — unverified bots or bots with category `unknown-ai`\n\nWhen a bot's action resolves to `'inherit'` (no group or preset covers it), the bot is omitted from the output.\n\n## Custom rules\n\nFor anything not covered by the preset or registry, use `rules` to add raw robots.txt directives:\n\n```ts\ncrawlerPolicy({\n  rules: [\n    {\n      userAgent: '*',\n      disallow: ['/admin/', '/internal/'],\n      crawlDelay: 2\n    },\n    {\n      userAgent: 'Slurp',\n      disallow: ['/']\n    }\n  ]\n})\n```\n\nA `userAgent: '*'` rule in `rules` is merged with the wildcard block that the preset generates — it does not create a second `User-agent: *` section.\n\nAvailable fields per rule:\n\n| Field | Type | Description |\n|---|---|---|\n| `userAgent` | `string \\| string[]` | One or more User-agent values |\n| `allow` | `string[]` | Paths to allow |\n| `disallow` | `string[]` | Paths to disallow |\n| `crawlDelay` | `number` | Crawl-delay in seconds |\n| `comment` | `string` | Inline comment above the rule |\n\n## Merge strategy\n\nWhen a `public/robots.txt` already exists, the merge strategy controls how it is combined with the generated output.\n\n| Strategy | Result |\n|---|---|\n| `prepend` *(default)* | Generated output first, then existing file |\n| `append` | Existing file first, then generated output |\n| `replace` | Generated output only, existing file ignored |\n\n```ts\ncrawlerPolicy({\n  mergeStrategy: 'prepend'\n})\n```\n\nUse `prepend` to let the generated policy take precedence. Use `append` to keep hand-written rules at the top. Use `replace` when you want full control from config and no manual overrides.\n\n## Environment overrides\n\nThe plugin detects the current environment from these variables, in order:\n\n1. `CONTEXT` (Netlify)\n2. `DEPLOYMENT_ENVIRONMENT`\n3. `NODE_ENV`\n4. Falls back to `'production'`\n\nUse `env` to apply different settings per environment:\n\n```ts\ncrawlerPolicy({\n  preset: 'citationFriendly',\n  env: {\n    staging: { preset: 'lockdown' },\n    preview: { preset: 'lockdown' }\n  }\n})\n```\n\nAny option can be overridden per environment. Nested objects (`contentSignals`, `bots`, `groups`) are merged — not replaced — with the base config.\n\n## Output files\n\n```ts\ncrawlerPolicy({\n  output: {\n    robotsTxt: true,  // default — writes dist/robots.txt\n    llmsTxt: true     // opt-in — writes dist/llms.txt\n  }\n})\n```\n\n### llms.txt\n\nWhen `output.llmsTxt: true` is set, the plugin generates `dist/llms.txt` alongside `robots.txt`. The file is a Markdown summary of the AI content policy — which crawlers are allowed or blocked, what signals are active, and where the sitemap is:\n\n```md\n# example.com\n\n\u003e AI content access policy for example.com.\n\u003e Generated by @casoon/astro-crawler-policy (preset: citationFriendly).\n\n## Content Policy\n\n- Search indexing: allowed\n- AI citation and summarization: allowed\n- AI training data collection: not allowed\n\n## AI Systems\n\n### Allowed\n- OAI-SearchBot (OpenAI)\n- ClaudeBot (Anthropic)\n- claude-web (Anthropic)\n- PerplexityBot (Perplexity)\n- meta-externalagent (Meta)\n- Amazonbot (Amazon)\n- Googlebot (Google)\n- Bingbot (Microsoft)\n\n### Blocked\n- GPTBot (OpenAI)\n- Google-Extended (Google)\n- CCBot (Common Crawl)\n- Bytespider (ByteDance)\n- Applebot-Extended (Apple)\n\n## Sitemaps\n\n- https://example.com/sitemap-index.xml\n```\n\n## Debug mode\n\nSet `debug: true` to print the resolved configuration to the build log:\n\n```ts\ncrawlerPolicy({ debug: true })\n```\n\nBuild output:\n\n```\n[@casoon/astro-crawler-policy] [debug] registry version: 2026-04-09\n[@casoon/astro-crawler-policy] [debug] environment: production\n[@casoon/astro-crawler-policy] [debug] preset: citationFriendly\n[@casoon/astro-crawler-policy] [debug] content signals: search=yes, aiInput=yes, aiTrain=no\n[@casoon/astro-crawler-policy] [debug] bot: GPTBot → disallow\n[@casoon/astro-crawler-policy] [debug] bot: OAI-SearchBot → allow\n...\n[@casoon/astro-crawler-policy] [debug] sitemap: https://example.com/sitemap-index.xml\n```\n\n## Bot registry\n\nThe following bots are known and can be referenced by ID in `bots: {}`:\n\n| ID | Provider | Categories | Group |\n|---|---|---|---|\n| `GPTBot` | OpenAI | ai-training | verifiedAi |\n| `OAI-SearchBot` | OpenAI | ai-search, ai-input | verifiedAi |\n| `ClaudeBot` | Anthropic | ai-input, ai-training | verifiedAi |\n| `claude-web` | Anthropic | ai-input | verifiedAi |\n| `Google-Extended` | Google | ai-training | verifiedAi |\n| `CCBot` | Common Crawl | ai-training | verifiedAi |\n| `PerplexityBot` | Perplexity | ai-search, ai-input | verifiedAi |\n| `Bytespider` | ByteDance | ai-training | verifiedAi |\n| `meta-externalagent` | Meta | ai-input, ai-training | verifiedAi |\n| `Amazonbot` | Amazon | ai-search, ai-input | verifiedAi |\n| `Applebot-Extended` | Apple | ai-training | verifiedAi |\n| `Googlebot` | Google | search | searchEngines |\n| `Bingbot` | Microsoft | search | searchEngines |\n\n## Extending the registry\n\nThe built-in registry covers the most common crawlers. To support bots not yet listed, use `extraBots`:\n\n```ts\ncrawlerPolicy({\n  extraBots: [\n    {\n      id: 'MyCustomBot',\n      provider: 'Acme Corp',\n      userAgents: ['MyCustomBot/1.0'],\n      categories: ['ai-training'],\n      verified: true\n    }\n  ],\n  bots: {\n    MyCustomBot: 'disallow'\n  }\n})\n```\n\nExtra bots participate in group rules, per-bot overrides, audit checks, and `llms.txt` output — the same as built-in bots.\n\n**Keeping the registry up to date:** The registry ships as part of the package. As new crawlers emerge, updates are released as patch versions. Run `npm update @casoon/astro-crawler-policy` to get the latest bot data. The `REGISTRY_VERSION` export contains the date of the last registry update.\n\n## Audit warnings\n\nThe plugin emits warnings and info messages during the build:\n\n| Code | Level | Condition |\n|---|---|---|\n| `MISSING_SITE_URL` | warn | No `site` set in Astro config |\n| `NO_SITEMAP` | info | No sitemaps configured |\n| `DUPLICATE_USER_AGENT_RULE` | warn | Two rules share the same User-agent |\n| `UNLOCKED_NON_PRODUCTION_ENVIRONMENT` | warn | Staging/preview not globally blocked |\n| `NON_STANDARD_DIRECTIVES` | info | Content signals may trigger GSC syntax warnings |\n| `AI_INPUT_WITHOUT_ALLOWED_BOTS` | warn | `aiInput` enabled but all AI bots blocked |\n| `UNKNOWN_BOT_ID` | warn | A bot ID in `bots: {}` is not in the registry |\n| `GROUP_BOT_OVERRIDE_CONFLICT` | info | Bot override contradicts its group rule |\n\nAudit settings:\n\n```ts\ncrawlerPolicy({\n  audit: {\n    warnOnMissingSitemap: true,  // default\n    warnOnConflicts: true        // default\n  }\n})\n```\n\n## Programmatic usage\n\nThe core modules are exported for use outside of the Astro integration:\n\n```ts\nimport {\n  compilePolicy,\n  renderRobotsTxt,\n  renderLlmsTxt,\n  auditPolicy,\n  defaultRegistry,\n  REGISTRY_VERSION\n} from '@casoon/astro-crawler-policy';\n\nconst policy = compilePolicy({\n  options: { preset: 'citationFriendly', sitemaps: ['/sitemap-index.xml'] },\n  site: 'https://example.com',\n  environment: 'production'\n});\n\nconst robotsTxt = renderRobotsTxt(policy);\nconst llmsTxt = renderLlmsTxt(policy, 'https://example.com');\nconst issues = auditPolicy(policy, { site: 'https://example.com', registry: defaultRegistry });\n```\n\n## Generated output examples\n\n### citationFriendly (default)\n\n```ts\ncrawlerPolicy({\n  preset: 'citationFriendly',\n  sitemaps: ['/sitemap-index.xml']\n})\n```\n\n```\n# Generated by @casoon/astro-crawler-policy\n# preset: citationFriendly\n\nUser-agent: *\nContent-Signal: search=yes, ai-input=yes, ai-train=no\nAllow: /\n\nUser-agent: GPTBot\nDisallow: /\n\nUser-agent: OAI-SearchBot\nAllow: /\n\nUser-agent: ClaudeBot\nAllow: /\n\nUser-agent: claude-web\nAllow: /\n\nUser-agent: Google-Extended\nDisallow: /\n\nUser-agent: CCBot\nDisallow: /\n\nUser-agent: PerplexityBot\nAllow: /\n\nUser-agent: Bytespider\nDisallow: /\n\nUser-agent: meta-externalagent\nAllow: /\n\nUser-agent: Amazonbot\nAllow: /\n\nUser-agent: Applebot-Extended\nDisallow: /\n\nUser-agent: Googlebot\nAllow: /\n\nUser-agent: Bingbot\nAllow: /\n\nSitemap: https://example.com/sitemap-index.xml\n```\n\n### seoOnly\n\n```ts\ncrawlerPolicy({ preset: 'seoOnly' })\n```\n\n```\n# Generated by @casoon/astro-crawler-policy\n# preset: seoOnly\n\nUser-agent: *\nContent-Signal: search=yes, ai-input=no, ai-train=no\nAllow: /\n\nUser-agent: GPTBot\nDisallow: /\n\nUser-agent: OAI-SearchBot\nDisallow: /\n\nUser-agent: ClaudeBot\nDisallow: /\n\nUser-agent: claude-web\nDisallow: /\n\nUser-agent: Google-Extended\nDisallow: /\n\nUser-agent: CCBot\nDisallow: /\n\nUser-agent: PerplexityBot\nDisallow: /\n\nUser-agent: Bytespider\nDisallow: /\n\nUser-agent: meta-externalagent\nDisallow: /\n\nUser-agent: Amazonbot\nDisallow: /\n\nUser-agent: Applebot-Extended\nDisallow: /\n\nUser-agent: Googlebot\nAllow: /\n\nUser-agent: Bingbot\nAllow: /\n```\n\n### lockdown (staging/preview)\n\n```ts\ncrawlerPolicy({\n  env: {\n    staging: { preset: 'lockdown' },\n    preview: { preset: 'lockdown' }\n  }\n})\n```\n\nWhen `CONTEXT=staging` or `NODE_ENV=staging`:\n\n```\n# Generated by @casoon/astro-crawler-policy\n# preset: lockdown\n\nUser-agent: *\nContent-Signal: search=no, ai-input=no, ai-train=no\nDisallow: /\n```\n\n---\n\n\u003e This tool only works for crawlers and AI bots that actually respect robots.txt. Respect, however, is rare these days.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcasoon%2Fastro-crawler-policy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcasoon%2Fastro-crawler-policy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcasoon%2Fastro-crawler-policy/lists"}