{"id":51222048,"url":"https://github.com/ipanalytics/ai-crawler-blocklist","last_synced_at":"2026-06-28T08:00:55.895Z","repository":{"id":365501502,"uuid":"1272360590","full_name":"ipanalytics/AI-Crawler-Blocklist","owner":"ipanalytics","description":"AI-Crawler-Blocklist publishes AI crawler blocklists and deployment-ready firewall snippets from official operator-published sources. It separates verified IP ranges, user-agent rules, robots.txt controls, and watch lists so site operators can choose the right enforcement level without mixing signal quality.","archived":false,"fork":false,"pushed_at":"2026-06-25T15:13:21.000Z","size":253,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-25T17:08:24.009Z","etag":null,"topics":["ai-bots","ai-crawlers","bot-blocklist","cloudflare","crawler-blocklist","firewall","ip-blocklist","llm","nginx","open-source-intelligence","robots-txt"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ipanalytics.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-17T14:34:04.000Z","updated_at":"2026-06-25T15:13:25.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ipanalytics/AI-Crawler-Blocklist","commit_stats":null,"previous_names":["ipanalytics/ai-crawler-blocklist"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/ipanalytics/AI-Crawler-Blocklist","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipanalytics%2FAI-Crawler-Blocklist","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipanalytics%2FAI-Crawler-Blocklist/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipanalytics%2FAI-Crawler-Blocklist/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipanalytics%2FAI-Crawler-Blocklist/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ipanalytics","download_url":"https://codeload.github.com/ipanalytics/AI-Crawler-Blocklist/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipanalytics%2FAI-Crawler-Blocklist/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34881384,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-28T02:00:05.809Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-bots","ai-crawlers","bot-blocklist","cloudflare","crawler-blocklist","firewall","ip-blocklist","llm","nginx","open-source-intelligence","robots-txt"],"created_at":"2026-06-28T08:00:52.966Z","updated_at":"2026-06-28T08:00:55.885Z","avatar_url":"https://github.com/ipanalytics.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AI-Crawler-Blocklist\n\nAI-Crawler-Blocklist publishes AI crawler blocklists and deployment-ready firewall snippets from official operator-published sources. It separates verified IP ranges, user-agent rules, robots.txt controls, and watch lists so site operators can choose the right enforcement level without mixing signal quality.\n\n\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/ipanalytics/AI-Crawler-Blocklist/actions/workflows/update.yml\"\u003e\u003cimg alt=\"Update\" src=\"https://img.shields.io/github/actions/workflow/status/ipanalytics/AI-Crawler-Blocklist/update.yml?branch=main\u0026label=update\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/ipanalytics/AI-Crawler-Blocklist/actions/workflows/validate-pr.yml\"\u003e\u003cimg alt=\"CI\" src=\"https://img.shields.io/github/actions/workflow/status/ipanalytics/AI-Crawler-Blocklist/validate-pr.yml?branch=main\u0026label=ci\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/ipanalytics/AI-Crawler-Blocklist/releases\"\u003e\u003cimg alt=\"Release\" src=\"https://img.shields.io/github/v/release/ipanalytics/AI-Crawler-Blocklist?display_name=tag\u0026sort=date\"\u003e\u003c/a\u003e\n  \u003cimg alt=\"Dataset\" src=\"https://img.shields.io/badge/dataset-generated%20dist-2f6fed\"\u003e\n  \u003cimg alt=\"Python\" src=\"https://img.shields.io/badge/python-3.12-3776ab\"\u003e\n  \u003cimg alt=\"License\" src=\"https://img.shields.io/badge/license-MIT-blue\"\u003e\n\u003c/p\u003e\n\n---\n\n## Links\n\n| Resource | URL |\n| --- | --- |\n| Generated artifacts | [`dist/`](./dist) |\n| Source policy | [`docs/source-policy.md`](./docs/source-policy.md) |\n| Firewall deployment notes | [`docs/firewalls.md`](./docs/firewalls.md) |\n| Operating modes | [`docs/modes.md`](./docs/modes.md) |\n| Source health report | [`dist/sources-report.md`](./dist/sources-report.md) |\n| Machine-readable metadata | [`dist/metadata.json`](./dist/metadata.json) |\n\n## Overview\n\nAI-Crawler-Blocklist is built for publishers, application operators, infrastructure teams, and security engineers who need repeatable controls for AI training crawlers, AI search bots, assistant fetchers, and related indexing systems.\n\nThe repository consumes curated source definitions from `config/sources.json`, validates the source policy, fetches official IP feeds where available, normalizes CIDRs, and renders platform-specific outputs under `dist/`. Source failures are recorded in metadata instead of failing the entire build, which keeps scheduled updates operational while preserving source health visibility.\n\n## System Behavior\n\n```text\nconfig/sources.json\n        |\n        v\nscripts/normalize_sources.py  -\u003e confidence, enforcement, source policy\n        |\n        v\nscripts/fetch_sources.py      -\u003e official JSON/text/embedded JSON/static prefixes\n        |\n        v\nscripts/build.py              -\u003e deterministic dist artifacts\n        |\n        v\ndist/metadata.json + firewall snippets + robots.txt + plain lists\n```\n\nEnforcement is derived from source quality:\n\n| Class | Source quality | Output behavior |\n| --- | --- | --- |\n| `verified-drop` | Official crawler-specific IP/CIDR feed | Eligible for IP hard drop |\n| `ua-only` | Documented user-agent without verified IP feed | User-agent block rules only |\n| `robots-only` | Robots token such as `Google-Extended` | robots.txt outputs only |\n| `static-watch` | Broad static ranges, CN/watch, platform ranges, weak signals | Observe, challenge, or rate-limit |\n\n## Features\n\n- Verified IPv4/IPv6 lists for official AI crawler IP feeds.\n- User-agent lists, regex lists, nginx maps, Apache SetEnvIf rules, and Cloudflare expressions.\n- robots.txt snippets for training opt-out, all AI bots, CN/watch bots, and search-safe AI opt-out.\n- iptables/ipset, nftables, pf/pfSense, Caddy, HAProxy, and Traefik outputs.\n- Deterministic builds with fixed timestamp support via `CRAWLERSCOPE_GENERATED_AT`.\n- Machine-readable metadata with counts, source health, confidence, enforcement, and failed sources.\n- Scheduled GitHub Actions update workflow and daily release workflow.\n\n## Quick Start\n\n```bash\ngit clone https://github.com/ipanalytics/AI-Crawler-Blocklist.git\ncd AI-Crawler-Blocklist\nmake install-dev\nmake build\nmake validate\nmake test\n```\n\nFor sandboxed environments where `uv` must keep all state inside the worktree:\n\n```bash\nUV_CACHE_DIR=.uv-cache UV_PYTHON_INSTALL_DIR=.uv-python \\\n  uv run --python 3.12 python scripts/build.py\n```\n\n## Installation\n\nThe generated files are intended to be consumed directly from GitHub raw URLs or vendored into your own configuration management.\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/metadata.json\n```\n\nPinning to a release tag is recommended for controlled production rollout:\n\n```bash\ncurl -fsSL https://github.com/ipanalytics/AI-Crawler-Blocklist/releases/latest/download/ai-crawler-blocklist-dist.tar.gz \\\n  -o ai-crawler-blocklist-dist.tar.gz\n```\n\n## Usage Examples\n\n### robots.txt\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/robots-ai-all-block.txt \\\n  -o /var/www/html/robots.txt\n```\n\n### nginx\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/nginx-ai-map.conf \\\n  -o /etc/nginx/snippets/nginx-ai-map.conf\n```\n\n```nginx\ninclude /etc/nginx/snippets/nginx-ai-map.conf;\n\nserver {\n    if ($ai_crawler) {\n        return 403;\n    }\n}\n```\n\n```bash\nnginx -t \u0026\u0026 systemctl reload nginx\n```\n\n### Apache\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/apache-ai-setenvif.conf \\\n  -o /etc/apache2/conf-available/ai-crawlers.conf\n\na2enconf ai-crawlers\napachectl configtest \u0026\u0026 systemctl reload apache2\n```\n\n### Cloudflare WAF\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/cloudflare-ai-expression.txt\n```\n\nUse the expression in a WAF Custom Rule. The output is UA-based and designed for review before deployment.\n\n### iptables\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/iptables-ai.sh \\\n  -o /usr/local/sbin/update-ai-iptables.sh\n\nchmod +x /usr/local/sbin/update-ai-iptables.sh\n/usr/local/sbin/update-ai-iptables.sh\n```\n\nThe generated script uses `ipset` for set-based matching.\n\n### nftables\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/nftables-ai.nft \\\n  -o /etc/nftables.d/ai-crawlers.nft\n\nnft -f /etc/nftables.d/ai-crawlers.nft\n```\n\n### Caddy\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/caddy-ai-block.caddy \\\n  -o /etc/caddy/snippets/ai-crawlers.caddy\n\ncaddy validate --config /etc/caddy/Caddyfile \u0026\u0026 systemctl reload caddy\n```\n\n### HAProxy\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/ipanalytics/AI-Crawler-Blocklist/main/dist/haproxy-ai-acl.cfg \\\n  -o /etc/haproxy/ai-crawlers.cfg\n\nhaproxy -c -f /etc/haproxy/haproxy.cfg \u0026\u0026 systemctl reload haproxy\n```\n\n## Outputs\n\n| Artifact | Purpose |\n| --- | --- |\n| `dist/ai-ips-verified-v4.txt` | Verified official IPv4 CIDRs |\n| `dist/ai-ips-verified-v6.txt` | Verified official IPv6 CIDRs |\n| `dist/ai-ips-verified-all.txt` | Combined verified CIDRs |\n| `dist/ai-ips-high-confidence-v4.txt` | IPv4 challenge/rate-limit candidates |\n| `dist/ai-ips-high-confidence-v6.txt` | IPv6 challenge/rate-limit candidates |\n| `dist/ai-user-agents.txt` | Plain AI crawler UA tokens |\n| `dist/ai-user-agents-regex.txt` | Escaped UA regex tokens |\n| `dist/ai-cn-user-agents-watch.txt` | CN/watch UA list |\n| `dist/robots-ai-all-block.txt` | robots.txt rules for AI bots and robots-only tokens |\n| `dist/cloudflare-ai-expression.txt` | Cloudflare WAF expression |\n| `dist/metadata.json` | Source health and counts |\n| `dist/sources-report.md` | Human-readable source report |\n\n\u003cdetails\u003e\n\u003csummary\u003ePlatform-specific files\u003c/summary\u003e\n\n| Artifact | Platform |\n| --- | --- |\n| `dist/nginx-ai-map.conf` | nginx |\n| `dist/nginx-ai-deny.conf` | nginx |\n| `dist/apache-ai-setenvif.conf` | Apache |\n| `dist/iptables-ai.sh` | iptables/ipset |\n| `dist/nftables-ai.nft` | nftables |\n| `dist/pf-ai-table.conf` | pf / pfSense |\n| `dist/caddy-ai-block.caddy` | Caddy |\n| `dist/haproxy-ai-acl.cfg` | HAProxy |\n| `dist/traefik-ai-middleware.yml` | Traefik |\n\n\u003c/details\u003e\n\n## Data Format\n\nAll generated text files include a header with project name, generation timestamp, source repository, policy, and review note.\n\n`dist/metadata.json` is the operational source of truth for current build state:\n\n```json\n{\n  \"generated_at\": \"2026-06-17T00:00:00Z\",\n  \"project\": \"AI-Crawler-Blocklist\",\n  \"policy\": \"official/operator-published sources only\",\n  \"counts\": {\n    \"verified_ipv4_prefixes\": 2261,\n    \"verified_ipv6_prefixes\": 1,\n    \"user_agent_patterns\": 24,\n    \"robots_tokens\": 26\n  },\n  \"failed_sources\": []\n}\n```\n\nSource definitions live in `config/sources.json`. The normalizer adds `confidence`, `enforcement`, `ipPolicy`, and `includeInAiOutputs` at build time.\n\n## Operational Notes\n\n- Use `verified-drop` artifacts for hard IP enforcement.\n- Use UA files for application-layer controls where IP ranges are unavailable.\n- Use watch lists for logging, challenge, bot-score adjustment, or rate limiting.\n- Treat `Google-Extended` and `Applebot-Extended` as robots.txt controls.\n- Review `dist/metadata.json` and `dist/sources-report.md` before rolling changes into production.\n\n## Project Scope\n\nThe project covers AI crawlers, AI search bots, assistant fetchers, training/indexing bots, and AI-adjacent archive sources such as CCBot. Generic search crawlers, SEO tools, uptime probes, ad verification crawlers, social preview bots, and security scanners are outside the generated AI output set unless explicitly classified by policy.\n\n## Use Cases\n\n- Publisher AI crawling controls.\n- WAF rule generation for known AI user agents.\n- Verified IP hard-drop lists for official crawler feeds.\n- Bot analytics enrichment from access logs.\n- Change-controlled distribution of crawler policy into infrastructure automation.\n\n## Limitations\n\n- User-agent strings can be spoofed.\n- robots.txt depends on crawler compliance.\n- Some assistant fetchers are user-triggered and may affect product visibility.\n- Broad cloud or platform ranges belong in observe/challenge workflows, not default hard drop.\n\n## Directory Structure\n\n```text\n.\n├── config/                 # source definitions, policy, output manifest, schema\n├── dist/                   # generated blocklists and platform artifacts\n├── docs/                   # operator documentation\n├── scripts/                # build, fetch, normalize, validate, render\n├── templates/              # Jinja templates for generated configs\n├── tests/                  # source policy, parsing, output, workflow tests\n└── .github/workflows/      # update, PR validation, daily release\n```\n\n## Deployment\n\nThe update workflow rebuilds `dist/` every six hours and commits changes when generated artifacts differ. The release workflow publishes a daily release containing the current `dist/` archive plus metadata and source report.\n\nProduction deployments should pin to a release tag or mirror `dist/` through internal configuration management. Direct raw URL consumption is suitable for simple hosts and lab environments.\n\n## License\n\nMIT. See [`LICENSE`](./LICENSE).\n\n## Disclaimer\n\nThis project provides defensive network and application-layer control data. Operators are responsible for testing enforcement impact in their own environment before blocking traffic.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fipanalytics%2Fai-crawler-blocklist","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fipanalytics%2Fai-crawler-blocklist","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fipanalytics%2Fai-crawler-blocklist/lists"}