{"id":50182080,"url":"https://github.com/ipanalytics/crawlerscope","last_synced_at":"2026-06-09T06:05:48.997Z","repository":{"id":358933514,"uuid":"1243777512","full_name":"ipanalytics/CrawlerScope","owner":"ipanalytics","description":"Interactive crawler IP intelligence dashboard for search, AI, and user-triggered fetchers.","archived":false,"fork":false,"pushed_at":"2026-05-19T20:06:01.000Z","size":337,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-19T20:11:52.373Z","etag":null,"topics":["ai-bots","ai-crawlers","bingbot","bot-detection","cidr","crawler","crawler-detection","data-visualization","github-pages","googlebot","gptbot","ip-ranges","nginx","open-data","osint","robots-txt","threat-intelligence","waf","web-security"],"latest_commit_sha":null,"homepage":"https://ipanalytics.github.io/CrawlerScope/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ipanalytics.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-19T16:50:05.000Z","updated_at":"2026-05-19T20:06:05.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ipanalytics/CrawlerScope","commit_stats":null,"previous_names":["ipanalytics/crawlerscope"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/ipanalytics/CrawlerScope","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipanalytics%2FCrawlerScope","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipanalytics%2FCrawlerScope/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipanalytics%2FCrawlerScope/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipanalytics%2FCrawlerScope/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ipanalytics","download_url":"https://codeload.github.com/ipanalytics/CrawlerScope/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipanalytics%2FCrawlerScope/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33464014,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-25T06:32:55.349Z","status":"ssl_error","status_checked_at":"2026-05-25T06:32:35.322Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-bots","ai-crawlers","bingbot","bot-detection","cidr","crawler","crawler-detection","data-visualization","github-pages","googlebot","gptbot","ip-ranges","nginx","open-data","osint","robots-txt","threat-intelligence","waf","web-security"],"created_at":"2026-05-25T07:04:49.796Z","updated_at":"2026-06-09T06:05:48.991Z","avatar_url":"https://github.com/ipanalytics.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CrawlerScope\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./public/assets/banner.png\" alt=\"CrawlerScope banner\" width=\"100%\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/ipanalytics/CrawlerScope/actions/workflows/crawler-scope.yml\"\u003e\u003cimg alt=\"CI\" src=\"https://img.shields.io/github/actions/workflow/status/ipanalytics/CrawlerScope/crawler-scope.yml?branch=main\u0026label=collector\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://ipanalytics.github.io/CrawlerScope/\"\u003e\u003cimg alt=\"GitHub Pages\" src=\"https://img.shields.io/badge/pages-online-brightgreen\"\u003e\u003c/a\u003e\n  \u003ca href=\"./LICENSE\"\u003e\u003cimg alt=\"License\" src=\"https://img.shields.io/badge/license-MIT-blue\"\u003e\u003c/a\u003e\n  \u003cimg alt=\"Dataset\" src=\"https://img.shields.io/badge/dataset-43%20services-2f6fdd\"\u003e\n  \u003cimg alt=\"Prefixes\" src=\"https://img.shields.io/badge/CIDR-7%2C180%20prefixes-success\"\u003e\n  \u003cimg alt=\"Version\" src=\"https://img.shields.io/badge/schema-v1-informational\"\u003e\n\u003c/p\u003e\n\nCrawlerScope collects operator-published crawler, fetcher, monitoring, scanner, and preview-bot network ranges, normalizes them into deployable CIDR data, and publishes a static dashboard plus machine-readable artifacts for infrastructure and security teams.\n\n**Live dashboard:** [ipanalytics.github.io/CrawlerScope](https://ipanalytics.github.io/CrawlerScope/)  \n**Current dataset:** [data/current/crawlers.json](./data/current/crawlers.json)\n\n---\n\n## Overview\n\nCrawlerScope is a small, auditable data pipeline for bot network intelligence. It tracks published source health, separates authoritative IP feeds from documented user-agent-only identities, and emits artifacts suitable for WAF rules, reverse proxies, allowlists, deny controls, analytics enrichment, and incident triage.\n\nThe project intentionally keeps source definitions in data, not code. Collector behavior lives in [`scripts/update.py`](./scripts/update.py); operator sources live in [`config/sources.json`](./config/sources.json).\n\n## Current Dataset\n\nGenerated at `2026-05-26T12:01:22Z`.\n\n| Metric | Count |\n|---|---:|\n| Services | 43 |\n| Healthy sources | 43 |\n| Authoritative IP lists | 32 |\n| CIDR prefixes | 7,180 |\n| IPv4 prefixes | 6,705 |\n| IPv6 prefixes | 475 |\n| AI crawler/fetcher prefixes | 1,653 |\n\n| Category | Services |\n|---|---:|\n| AI crawlers | 13 |\n| Search crawlers | 9 |\n| Monitoring probes | 5 |\n| Social previews | 4 |\n| Fetchers | 3 |\n| SEO crawlers | 3 |\n| Ad verification | 2 |\n| Security scanners | 2 |\n| Archive | 1 |\n| Analytics crawlers | 1 |\n\n\u003cdetails\u003e\n\u003csummary\u003eTracked services\u003c/summary\u003e\n\n| Service | Category | Source type | Prefixes |\n|---|---|---|---:|\n| Google common crawlers | search | official_json | 69 |\n| Google special crawlers | search | official_json | 46 |\n| Google user-triggered fetchers | fetcher | official_json | 223 |\n| Bingbot | search | official_json | 28 |\n| DuckDuckBot | search | official_json | 334 |\n| DuckAssistBot | ai | official_json | 334 |\n| Applebot | search | official_json | 12 |\n| MojeekBot | search | official_json | 1 |\n| Naver Yeti | search | official_json | 36 |\n| YandexBot | search | known_static | 13 |\n| Baiduspider | search | known_static | 2 |\n| GPTBot | ai | official_json | 17 |\n| OAI-SearchBot | ai | official_json | 32 |\n| ChatGPT-User | ai | official_json | 214 |\n| OAI-AdsBot | ai | documented_user_agent | 0 |\n| PerplexityBot | ai | official_json | 8 |\n| Perplexity-User | ai | official_json | 4 |\n| ClaudeBot / Claude-SearchBot | ai | documented_user_agent | 0 |\n| Amazonbot | ai | official_embedded_json | 524 |\n| Amzn-SearchBot | ai | official_embedded_json | 512 |\n| Amzn-User | fetcher | official_embedded_json | 1,023 |\n| Meta-ExternalAgent / Meta-WebIndexer | ai | known_static | 4 |\n| Bytespider | ai | documented_user_agent | 0 |\n| MistralAI-User | ai | official_json | 4 |\n| AhrefsBot | seo | official_json | 51 |\n| Lumar crawler | seo | official_json | 66 |\n| SemrushBot | seo | documented_user_agent | 0 |\n| Censys scanners | security-scanner | known_static | 2 |\n| Shodan scanners | security-scanner | known_static | 9 |\n| Datadog Synthetics | monitoring | official_json | 113 |\n| IAS crawler | ad-verification | official_json | 14 |\n| TTD-Content crawler | ad-verification | official_text | 2,615 |\n| UptimeRobot | monitoring | official_text | 217 |\n| Pingdom probes | monitoring | official_text | 158 |\n| StatusCake probes | monitoring | official_json | 296 |\n| Better Stack probes | monitoring | official_text | 34 |\n| Common Crawl CCBot | archive | official_json | 6 |\n| Flipboard crawler | social | official_text | 136 |\n| Parse.ly crawler | analytics | official_json | 10 |\n| Pinterestbot | social | documented_user_agent | 0 |\n| LinkedInBot | social | documented_user_agent | 0 |\n| Telegram link preview | social | official_text | 11 |\n| RSS API feed parser | fetcher | official_text | 2 |\n\n\u003c/details\u003e\n\n---\n\n## Architecture\n\nCrawlerScope runs as a scheduled GitHub Actions collector and publishes static artifacts.\n\n```mermaid\nflowchart LR\n  A[\"config/sources.json\"] --\u003e B[\"scripts/update.py\"]\n  B --\u003e C[\"Fetch operator sources\"]\n  C --\u003e D[\"Normalize and collapse CIDR prefixes\"]\n  D --\u003e E[\"data/current/crawlers.json\"]\n  D --\u003e F[\"data/current/robots-ai.txt\"]\n  D --\u003e G[\"data/current/nginx-ai-map.conf\"]\n  D --\u003e H[\"data/snapshots/*.json\"]\n  E --\u003e I[\"Static dashboard\"]\n  H --\u003e J[\"GitHub Release artifacts\"]\n```\n\nSource types:\n\n| Type | Meaning |\n|---|---|\n| `official_json` | Operator-published machine-readable JSON feed |\n| `official_text` | Operator-published plain-text CIDR/IP feed |\n| `official_embedded_json` | Operator page with machine-readable ranges embedded in HTML |\n| `documented_user_agent` | Documented bot identity without a stable public IP list |\n| `known_static` | Useful static seed list, not treated as complete authority |\n\n## Features\n\n- Operator-published source collection with source health tracking.\n- IPv4/IPv6 normalization, CIDR coercion, and prefix collapsing.\n- Static dashboard with category, operator, source, service, and search filters.\n- Filtered exports for JSON, CSV, CIDR lists, `robots.txt`, and Nginx user-agent maps.\n- Snapshot retention and historical summary tracking.\n- GitHub Pages publication and automatic dataset releases.\n- Config-driven source inventory in [`config/sources.json`](./config/sources.json).\n\n## Quick Start\n\nRun the collector and serve the dashboard locally:\n\n```bash\npython3 scripts/update.py\npython3 -m http.server 8080\n```\n\nOpen:\n\n```text\nhttp://127.0.0.1:8080/public/\n```\n\nWhen serving from `public/`, the app reads data from `../data/current`. For GitHub Pages deployment, the workflow copies `public/` and `data/` into the Pages artifact.\n\n## Installation\n\nCrawlerScope has no runtime dependency outside the Python standard library for data collection.\n\n```bash\ngit clone https://github.com/ipanalytics/CrawlerScope.git\ncd CrawlerScope\npython3 scripts/update.py\n```\n\nOptional environment controls:\n\n```bash\nexport CRAWLER_SCOPE_USER_AGENT=\"CrawlerScope/0.1 (+https://example.org/contact)\"\nexport CRAWLER_SCOPE_SNAPSHOT_RETENTION=168\nexport CRAWLER_SCOPE_HISTORY_RETENTION=720\npython3 scripts/update.py\n```\n\n## Usage Examples\n\nExport all current CIDRs:\n\n```bash\njq -r '.services[].prefixes | .ipv4[], .ipv6[]' data/current/crawlers.json\n```\n\nExport AI crawler CIDRs:\n\n```bash\njq -r '.services[] | select(.category == \"ai\") | .prefixes | .ipv4[], .ipv6[]' data/current/crawlers.json\n```\n\nList sources that are documented but do not publish IP ranges:\n\n```bash\njq -r '.services[] | select(.sourceType == \"documented_user_agent\") | [.id, .service, .sourceUrl] | @tsv' data/current/crawlers.json\n```\n\nGenerate an Nginx include from the current dataset:\n\n```bash\ncp data/current/nginx-ai-map.conf /etc/nginx/conf.d/crawler-scope-ai-map.conf\nnginx -t\n```\n\n## Outputs\n\n| Path | Description |\n|---|---|\n| [`data/current/crawlers.json`](./data/current/crawlers.json) | Full normalized dataset |\n| [`data/current/robots-ai.txt`](./data/current/robots-ai.txt) | Generated AI crawler `robots.txt` block |\n| [`data/current/nginx-ai-map.conf`](./data/current/nginx-ai-map.conf) | Nginx `map` for AI crawler user-agents |\n| [`data/history/summary.csv`](./data/history/summary.csv) | Historical summary rows |\n| [`data/snapshots/*.json`](./data/snapshots) | Timestamped dataset snapshots |\n| [`config/sources.json`](./config/sources.json) | Source inventory and classification config |\n\n## Data Format\n\nEach service record includes source metadata, user-agent patterns, reverse-DNS hints, health status, prefix counts, and split IPv4/IPv6 arrays.\n\n```json\n{\n  \"id\": \"openai-gptbot\",\n  \"service\": \"GPTBot\",\n  \"operator\": \"OpenAI\",\n  \"category\": \"ai\",\n  \"sourceType\": \"official_json\",\n  \"sourceOk\": true,\n  \"ipListAuthoritative\": true,\n  \"userAgentPatterns\": [\"GPTBot\"],\n  \"counts\": {\n    \"prefixes\": 17,\n    \"ipv4\": 17,\n    \"ipv6\": 0\n  },\n  \"prefixes\": {\n    \"ipv4\": [\"20.42.10.176/28\"],\n    \"ipv6\": []\n  }\n}\n```\n\n## Operational Notes\n\n- Treat `sourceOk=false` as a collection failure for that run. The collector falls back to the previous cached prefixes when available.\n- IP ranges identify published infrastructure, not intent. Use user-agent, reverse DNS, request behavior, and application context where enforcement risk matters.\n- Static and documented-only sources are included because they are operationally useful, but authoritative flags remain separate.\n- Release artifacts are generated by GitHub Actions after collection and attached to timestamped dataset releases.\n\n## Project Scope\n\nCrawlerScope tracks public crawler, fetcher, monitoring, scanner, analytics, and preview-bot infrastructure that is useful for request classification and network policy. It prioritizes primary operator-published sources. Aggregator repositories may be reviewed for discovery, but their URLs are not used as dataset sources.\n\n## Use Cases\n\n- WAF allow/deny policy design for crawler traffic.\n- Search and AI crawler visibility audits.\n- Security logging enrichment and bot attribution.\n- Monitoring probe allowlisting.\n- Fraud/risk triage for automated traffic.\n- Change tracking for published crawler infrastructure.\n\n## Limitations\n\n- Some operators publish user-agent documentation but no stable IP feed.\n- Cloud-hosted crawlers may share network space with unrelated workloads.\n- CIDR lists can change without notice; scheduled collection reduces but does not remove that latency.\n\n## Directory Structure\n\n```text\n.\n├── config/\n│   └── sources.json\n├── data/\n│   ├── current/\n│   ├── history/\n│   └── snapshots/\n├── public/\n│   ├── assets/\n│   └── index.html\n├── scripts/\n│   └── update.py\n└── .github/\n    └── workflows/\n```\n\n## Deployment\n\nThe included workflow runs every six hours and can be triggered manually:\n\n```yaml\non:\n  schedule:\n    - cron: \"23 */6 * * *\"\n  workflow_dispatch:\n```\n\nThe workflow:\n\n1. Runs `scripts/update.py`.\n2. Commits updated `data/` and `config/` changes.\n3. Publishes a timestamped GitHub Release with dataset artifacts.\n4. Deploys the static dashboard to GitHub Pages.\n\n## License\n\nCrawlerScope is released under the [MIT License](./LICENSE).\n\n## Disclaimer\n\nCrawlerScope publishes normalized data from public operator sources. Review upstream terms and validate enforcement logic before using the dataset in production controls.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fipanalytics%2Fcrawlerscope","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fipanalytics%2Fcrawlerscope","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fipanalytics%2Fcrawlerscope/lists"}