https://github.com/casoon/astro-crawler-policy

Policy-first crawler control for Astro — generates robots.txt and llms.txt with presets, per-bot rules, AI crawler registry, and build-time audits.
https://github.com/casoon/astro-crawler-policy
ai-crawler astro astro-integration crawler llms-txt robots-txt seo typescript
Last synced: 2 months ago
JSON representation
Policy-first crawler control for Astro — generates robots.txt and llms.txt with presets, per-bot rules, AI crawler registry, and build-time audits.
Host: GitHub
URL: https://github.com/casoon/astro-crawler-policy
Owner: casoon
License: mit
Created: 2026-04-09T09:42:55.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-04-09T10:31:39.000Z (4 months ago)
Last Synced: 2026-04-09T11:25:36.639Z (4 months ago)
Topics: ai-crawler, astro, astro-integration, crawler, llms-txt, robots-txt, seo, typescript
Language: TypeScript
Size: 35.2 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          > **This package is no longer actively maintained.**

> It has been superseded by [@casoon/astro-site-files](https://github.com/casoon/astro-site-files), which bundles robots.txt, llms.txt, sitemap.xml, security.txt, and humans.txt in a single integration. New features and fixes will only be made there.

# @casoon/astro-crawler-policy

Policy-first crawler control for Astro. Generates `robots.txt` (and optionally `llms.txt`) from a typed configuration at build time.

## What it does

- Generates `robots.txt` from a typed configuration — no manual file editing required

- Applies one of five built-in **presets** covering the most common use cases

- Supports **content signals** (`search`, `ai-input`, `ai-train`) for newer crawler directives

- Includes a **bot registry** with 13 known crawlers for per-bot and group-based rules

- **Merges** the generated output with an existing `public/robots.txt` (replace / prepend / append)

- Runs **build-time audits** that warn about common misconfigurations

- Optionally generates **`llms.txt`** — a markdown summary of the AI content policy

- Supports **environment-specific overrides** (e.g. lockdown on staging)

This plugin renders crawler policy. It does not enforce blocking at the network, WAF, or edge layer.

## Installation

```sh

npm install @casoon/astro-crawler-policy

```

## Quick start

```ts

// astro.config.ts

import { defineConfig } from 'astro/config';

import crawlerPolicy from '@casoon/astro-crawler-policy';

export default defineConfig({

  site: 'https://example.com',

  integrations: [

    crawlerPolicy({

      preset: 'citationFriendly',

      sitemaps: ['/sitemap-index.xml']

    })

  ]

});

```

The plugin hooks into `astro:build:done` and writes `dist/robots.txt`. With just these two options you get sensible defaults: search engines allowed, verified AI bots allowed for citation, AI training bots blocked.

## Presets

Presets are the primary way to express intent. Each preset sets default content signals and group-level rules.

| Preset | Search | AI citation | AI training | Unknown AI |

|---|---|---|---|---|

| `seoOnly` | allow | disallow | disallow | disallow |

| `citationFriendly` *(default)* | allow | allow | disallow | disallow |

| `openToAi` | allow | allow | allow | allow |

| `blockTraining` | allow | allow | disallow | disallow |

| `lockdown` | disallow | disallow | disallow | disallow |

`citationFriendly` allows bots that do citation or summarization but blocks bots whose only purpose is training data collection (GPTBot, Google-Extended, CCBot, Bytespider, Applebot-Extended). Bots with mixed roles like ClaudeBot are allowed.

`blockTraining` goes further and blocks every bot with any training category, including mixed bots like ClaudeBot and meta-externalagent.

`lockdown` adds a global `User-agent: * / Disallow: /` rule, overriding everything.

## Content signals

Content signals are non-standard directives appended to the wildcard `User-agent: *` block:

```

User-agent: *

Content-Signal: search=yes, ai-input=yes, ai-train=no

Allow: /

```

They communicate intent to crawlers that support them. The three signals map to:

| Signal | Meaning |

|---|---|

| `search` | Indexing for traditional search engines |

| `aiInput` | Using content as input for AI responses (citation, summarization) |

| `aiTrain` | Using content as AI training data |

The directive name and signal keys follow the [contentsignals.org](https://contentsignals.org) specification (proposed IETF aipref standard). Google Search Console may flag them as unrecognised directives — the audit system emits an `info` message when they are present.

Each preset sets default values for all three signals. You can override them individually:

```ts

crawlerPolicy({

  preset: 'citationFriendly',

  contentSignals: {

    aiTrain: true  // override just this one; search and aiInput come from the preset

  }

})

```

## Groups and per-bot rules

Rules are resolved in layers, from least to most specific:

1. **Preset** — sets group-level defaults

2. **`groups`** — overrides for entire bot categories

3. **`bots`** — overrides for individual bots by registry ID

A bot's final action is the most specific rule that applies to it. An explicit entry in `bots` always wins over a `groups` setting.

```ts

crawlerPolicy({

  preset: 'citationFriendly',

  // Override an entire group

  groups: {

    searchEngines: 'allow',  // default

    verifiedAi: 'allow',     // default

    unknownAi: 'disallow'    // default

  },

  // Override individual bots (takes precedence over groups)

  bots: {

    GPTBot: 'disallow',   // blocks this bot even if verifiedAi is 'allow'

    ClaudeBot: 'allow'    // allows this bot even if verifiedAi were 'disallow'

  }

})

```

The three groups are:

- **`searchEngines`** — bots with category `search` (Googlebot, Bingbot)

- **`verifiedAi`** — verified bots with AI categories (`ai-search`, `ai-input`, `ai-training`)

- **`unknownAi`** — unverified bots or bots with category `unknown-ai`

When a bot's action resolves to `'inherit'` (no group or preset covers it), the bot is omitted from the output.

## Custom rules

For anything not covered by the preset or registry, use `rules` to add raw robots.txt directives:

```ts

crawlerPolicy({

  rules: [

    {

      userAgent: '*',

      disallow: ['/admin/', '/internal/'],

      crawlDelay: 2

    },

    {

      userAgent: 'Slurp',

      disallow: ['/']

    }

  ]

})

```

A `userAgent: '*'` rule in `rules` is merged with the wildcard block that the preset generates — it does not create a second `User-agent: *` section.

Available fields per rule:

| Field | Type | Description |

|---|---|---|

| `userAgent` | `string \| string[]` | One or more User-agent values |

| `allow` | `string[]` | Paths to allow |

| `disallow` | `string[]` | Paths to disallow |

| `crawlDelay` | `number` | Crawl-delay in seconds |

| `comment` | `string` | Inline comment above the rule |

## Merge strategy

When a `public/robots.txt` already exists, the merge strategy controls how it is combined with the generated output.

| Strategy | Result |

|---|---|

| `prepend` *(default)* | Generated output first, then existing file |

| `append` | Existing file first, then generated output |

| `replace` | Generated output only, existing file ignored |

```ts

crawlerPolicy({

  mergeStrategy: 'prepend'

})

```

Use `prepend` to let the generated policy take precedence. Use `append` to keep hand-written rules at the top. Use `replace` when you want full control from config and no manual overrides.

## Environment overrides

The plugin detects the current environment from these variables, in order:

1. `CONTEXT` (Netlify)

2. `DEPLOYMENT_ENVIRONMENT`

3. `NODE_ENV`

4. Falls back to `'production'`

Use `env` to apply different settings per environment:

```ts

crawlerPolicy({

  preset: 'citationFriendly',

  env: {

    staging: { preset: 'lockdown' },

    preview: { preset: 'lockdown' }

  }

})

```

Any option can be overridden per environment. Nested objects (`contentSignals`, `bots`, `groups`) are merged — not replaced — with the base config.

## Output files

```ts

crawlerPolicy({

  output: {

    robotsTxt: true,  // default — writes dist/robots.txt

    llmsTxt: true     // opt-in — writes dist/llms.txt

  }

})

```

### llms.txt

When `output.llmsTxt: true` is set, the plugin generates `dist/llms.txt` alongside `robots.txt`. The file is a Markdown summary of the AI content policy — which crawlers are allowed or blocked, what signals are active, and where the sitemap is:

```md

# example.com

> AI content access policy for example.com.

> Generated by @casoon/astro-crawler-policy (preset: citationFriendly).

## Content Policy

- Search indexing: allowed

- AI citation and summarization: allowed

- AI training data collection: not allowed

## AI Systems

### Allowed

- OAI-SearchBot (OpenAI)

- ClaudeBot (Anthropic)

- claude-web (Anthropic)

- PerplexityBot (Perplexity)

- meta-externalagent (Meta)

- Amazonbot (Amazon)

- Googlebot (Google)

- Bingbot (Microsoft)

### Blocked

- GPTBot (OpenAI)

- Google-Extended (Google)

- CCBot (Common Crawl)

- Bytespider (ByteDance)

- Applebot-Extended (Apple)

## Sitemaps

- https://example.com/sitemap-index.xml

```

## Debug mode

Set `debug: true` to print the resolved configuration to the build log:

```ts

crawlerPolicy({ debug: true })

```

Build output:

```

[@casoon/astro-crawler-policy] [debug] registry version: 2026-04-09

[@casoon/astro-crawler-policy] [debug] environment: production

[@casoon/astro-crawler-policy] [debug] preset: citationFriendly

[@casoon/astro-crawler-policy] [debug] content signals: search=yes, aiInput=yes, aiTrain=no

[@casoon/astro-crawler-policy] [debug] bot: GPTBot → disallow

[@casoon/astro-crawler-policy] [debug] bot: OAI-SearchBot → allow

...

[@casoon/astro-crawler-policy] [debug] sitemap: https://example.com/sitemap-index.xml

```

## Bot registry

The following bots are known and can be referenced by ID in `bots: {}`:

| ID | Provider | Categories | Group |

|---|---|---|---|

| `GPTBot` | OpenAI | ai-training | verifiedAi |

| `OAI-SearchBot` | OpenAI | ai-search, ai-input | verifiedAi |

| `ClaudeBot` | Anthropic | ai-input, ai-training | verifiedAi |

| `claude-web` | Anthropic | ai-input | verifiedAi |

| `Google-Extended` | Google | ai-training | verifiedAi |

| `CCBot` | Common Crawl | ai-training | verifiedAi |

| `PerplexityBot` | Perplexity | ai-search, ai-input | verifiedAi |

| `Bytespider` | ByteDance | ai-training | verifiedAi |

| `meta-externalagent` | Meta | ai-input, ai-training | verifiedAi |

| `Amazonbot` | Amazon | ai-search, ai-input | verifiedAi |

| `Applebot-Extended` | Apple | ai-training | verifiedAi |

| `Googlebot` | Google | search | searchEngines |

| `Bingbot` | Microsoft | search | searchEngines |

## Extending the registry

The built-in registry covers the most common crawlers. To support bots not yet listed, use `extraBots`:

```ts

crawlerPolicy({

  extraBots: [

    {

      id: 'MyCustomBot',

      provider: 'Acme Corp',

      userAgents: ['MyCustomBot/1.0'],

      categories: ['ai-training'],

      verified: true

    }

  ],

  bots: {

    MyCustomBot: 'disallow'

  }

})

```

Extra bots participate in group rules, per-bot overrides, audit checks, and `llms.txt` output — the same as built-in bots.

**Keeping the registry up to date:** The registry ships as part of the package. As new crawlers emerge, updates are released as patch versions. Run `npm update @casoon/astro-crawler-policy` to get the latest bot data. The `REGISTRY_VERSION` export contains the date of the last registry update.

## Audit warnings

The plugin emits warnings and info messages during the build:

| Code | Level | Condition |

|---|---|---|

| `MISSING_SITE_URL` | warn | No `site` set in Astro config |

| `NO_SITEMAP` | info | No sitemaps configured |

| `DUPLICATE_USER_AGENT_RULE` | warn | Two rules share the same User-agent |

| `UNLOCKED_NON_PRODUCTION_ENVIRONMENT` | warn | Staging/preview not globally blocked |

| `NON_STANDARD_DIRECTIVES` | info | Content signals may trigger GSC syntax warnings |

| `AI_INPUT_WITHOUT_ALLOWED_BOTS` | warn | `aiInput` enabled but all AI bots blocked |

| `UNKNOWN_BOT_ID` | warn | A bot ID in `bots: {}` is not in the registry |

| `GROUP_BOT_OVERRIDE_CONFLICT` | info | Bot override contradicts its group rule |

Audit settings:

```ts

crawlerPolicy({

  audit: {

    warnOnMissingSitemap: true,  // default

    warnOnConflicts: true        // default

  }

})

```

## Programmatic usage

The core modules are exported for use outside of the Astro integration:

```ts

import {

  compilePolicy,

  renderRobotsTxt,

  renderLlmsTxt,

  auditPolicy,

  defaultRegistry,

  REGISTRY_VERSION

} from '@casoon/astro-crawler-policy';

const policy = compilePolicy({

  options: { preset: 'citationFriendly', sitemaps: ['/sitemap-index.xml'] },

  site: 'https://example.com',

  environment: 'production'

});

const robotsTxt = renderRobotsTxt(policy);

const llmsTxt = renderLlmsTxt(policy, 'https://example.com');

const issues = auditPolicy(policy, { site: 'https://example.com', registry: defaultRegistry });

```

## Generated output examples

### citationFriendly (default)

```ts

crawlerPolicy({

  preset: 'citationFriendly',

  sitemaps: ['/sitemap-index.xml']

})

```

```

# Generated by @casoon/astro-crawler-policy

# preset: citationFriendly

User-agent: *

Content-Signal: search=yes, ai-input=yes, ai-train=no

Allow: /

User-agent: GPTBot

Disallow: /

User-agent: OAI-SearchBot

Allow: /

User-agent: ClaudeBot

Allow: /

User-agent: claude-web

Allow: /

User-agent: Google-Extended

Disallow: /

User-agent: CCBot

Disallow: /

User-agent: PerplexityBot

Allow: /

User-agent: Bytespider

Disallow: /

User-agent: meta-externalagent

Allow: /

User-agent: Amazonbot

Allow: /

User-agent: Applebot-Extended

Disallow: /

User-agent: Googlebot

Allow: /

User-agent: Bingbot

Allow: /

Sitemap: https://example.com/sitemap-index.xml

```

### seoOnly

```ts

crawlerPolicy({ preset: 'seoOnly' })

```

```

# Generated by @casoon/astro-crawler-policy

# preset: seoOnly

User-agent: *

Content-Signal: search=yes, ai-input=no, ai-train=no

Allow: /

User-agent: GPTBot

Disallow: /

User-agent: OAI-SearchBot

Disallow: /

User-agent: ClaudeBot

Disallow: /

User-agent: claude-web

Disallow: /

User-agent: Google-Extended

Disallow: /

User-agent: CCBot

Disallow: /

User-agent: PerplexityBot

Disallow: /

User-agent: Bytespider

Disallow: /

User-agent: meta-externalagent

Disallow: /

User-agent: Amazonbot

Disallow: /

User-agent: Applebot-Extended

Disallow: /

User-agent: Googlebot

Allow: /

User-agent: Bingbot

Allow: /

```

### lockdown (staging/preview)

```ts

crawlerPolicy({

  env: {

    staging: { preset: 'lockdown' },

    preview: { preset: 'lockdown' }

  }

})

```

When `CONTEXT=staging` or `NODE_ENV=staging`:

```

# Generated by @casoon/astro-crawler-policy

# preset: lockdown

User-agent: *

Content-Signal: search=no, ai-input=no, ai-train=no

Disallow: /

```

---

> This tool only works for crawlers and AI bots that actually respect robots.txt. Respect, however, is rare these days.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/casoon/astro-crawler-policy

Awesome Lists containing this project

README