{"id":50505442,"url":"https://github.com/djeshkov/nginx-autoblock","last_synced_at":"2026-06-02T15:31:08.769Z","repository":{"id":358054240,"uuid":"1238897585","full_name":"djeshkov/nginx-autoblock","owner":"djeshkov","description":"Behavioral subnet autoblocker for Nginx — composite scoring + free IP reputation","archived":false,"fork":false,"pushed_at":"2026-05-15T13:48:13.000Z","size":59,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-15T14:28:00.763Z","etag":null,"topics":["bot-blocker","cloudflare","ip-reputation","nginx","rate-limiting","security","web-scraping-protection"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/djeshkov.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-14T15:00:23.000Z","updated_at":"2026-05-15T13:40:21.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/djeshkov/nginx-autoblock","commit_stats":null,"previous_names":["djeshkov/nginx-autoblock"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/djeshkov/nginx-autoblock","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djeshkov%2Fnginx-autoblock","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djeshkov%2Fnginx-autoblock/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djeshkov%2Fnginx-autoblock/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djeshkov%2Fnginx-autoblock/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/djeshkov","download_url":"https://codeload.github.com/djeshkov/nginx-autoblock/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djeshkov%2Fnginx-autoblock/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33829340,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-02T02:00:07.132Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bot-blocker","cloudflare","ip-reputation","nginx","rate-limiting","security","web-scraping-protection"],"created_at":"2026-06-02T15:31:07.697Z","updated_at":"2026-06-02T15:31:08.759Z","avatar_url":"https://github.com/djeshkov.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# nginx-autoblock\n\nBehavioral autoblocker for Nginx. Detects bot crawlers by **composite scoring** across multiple signals (UA diversity, request patterns, IP reputation, behavioral fingerprint) and adds offending **subnets, individual IPs and UA-clusters** to nginx's block-list with TTL.\n\nDesigned for three threat classes that per-IP rate-limiting (`limit_req_zone $binary_remote_addr`) misses:\n- **Concentrated botnets** — same /24 producing 100+ req/h, each IP individually below per-IP limits (subnet pass, default).\n- **Distributed scraping** — hundreds of cloud IPs from many ASNs, 1-2 requests each, mass-scraping public URLs harvested from sitemaps or catalog/product pages (per-IP pass, opt-in since v1.1).\n- **Distributed botnets** — hundreds of IPs making ~1 request each while rotating a tiny pool of User-Agent strings; no IP and no /24 stands out, but the shared UA does (UA-cluster pass, opt-in since v1.2).\n\n```\n                ┌─────────────────────────────────────────────────┐\n   nginx logs ──┤  autoblock (every 10 min via cron)              │\n                │                                                 │\n                │   Subnet pass (default):                        │\n                │     group requests by /24 or /64                │\n                │     score 0-11 against 5 behavioral signals     │\n                │     enrich via ip-api.com (free)                │\n                │     block /24 if score ≥ 7                      │\n                │                                                 │\n                │   Per-IP pass (opt-in since v1.1):              │\n                │     score each IP 0-14 (path-agnostic)          │\n                │     catches distributed scrapers (1 req/IP)     │\n                │     block /32 if score ≥ 9                      │\n                │                                                 │\n                │   UA-cluster pass (opt-in since v1.2):          │\n                │     group by User-Agent, score the cluster      │\n                │     catches distributed botnets (shared UA)     │\n                │     block member /32s if score ≥ 7              │\n                └────────────────┬────────────────────────────────┘\n                                 │\n                                 ▼\n            /etc/nginx/blocked-subnets.conf      (subnet pass)\n            /etc/nginx/blocked-ips.conf          (per-IP pass)\n            /etc/nginx/blocked-ua-clusters.conf  (UA-cluster pass)\n                                 │\n                                 ▼\n                       nginx returns 444 to bot\n```\n\n## Why this exists\n\nPer-IP rate limits (`limit_req_zone $binary_remote_addr`) don't catch distributed crawls: a bot operator with 25 IPs inside one /24 emits 1.5 req/min per IP — far below the per-IP threshold, but ~38 req/min from the subnet in aggregate, with one User-Agent and identical request patterns.\n\nExisting tools occupy adjacent niches:\n\n| Tool | Approach | Limitation |\n|------|----------|------------|\n| [nginx-ultimate-bad-bot-blocker](https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker) | Static UA/referrer/IP block-lists + fail2ban | Not adaptive — won't catch new bots without list updates |\n| [fail2ban-subnets](https://github.com/XaF/fail2ban-subnets) / [recidive-subnet](https://github.com/ruppel/fail2ban-recidive-subnet) | Escalate per-IP bans to /24 when enough hits | Counter-only — no behavioral analysis; depends on per-IP bans firing first |\n| [Cloudflare Bot Management](https://developers.cloudflare.com/bots/concepts/bot-score/) | ML scoring 1-99 | Paid, vendor lock-in |\n\n`nginx-autoblock` sits in the middle: **adaptive behavioral scoring with free reputation data**, no fail2ban dependency.\n\n## How it works\n\nFor each `/24` (IPv4) or `/64` (IPv6) seen in the last 30 minutes, score against 5 signals (max 11 points). Block if score ≥ 7.\n\n| Signal | Points | What it detects |\n|--------|--------|-----------------|\n| `≤ 2` unique User-Agents | **+2** | Homogeneous bot farm |\n| Target paths ≥ 50% / ≥ 80% of requests | **+1 / +1** additional | Focused API or search hammering |\n| Top-3 URLs ≥ 50% / ≥ 80% of requests | **+1 / +1** additional | Low URL diversity (bot vs human browsing) |\n| Referer rate \u003c 30% / \u003c 10% | **+1 / +1** additional | Real browsers send referer on link clicks |\n| ip-api.com `hosting=true` OR `proxy=true` | **+3** | Datacenter / proxy origin |\n| ip-api.com `mobile=true` | **-1** | Mobile carrier — likely real users |\n\n**Gates:**\n- Subnet must have ≥ `min_requests` (default 200) in the window — below this, not evaluated.\n- Whitelist hits (search engines, AI bots, your own IPs) are skipped before scoring.\n\n**Static-asset ratio is NOT a signal.** Behind a CDN, static files (CSS/JS/images) are served from the edge cache — only ~5% of static traffic reaches origin nginx, so this ratio is similar between humans and bots at origin and provides no discrimination.\n\n**ip-api.com batch enrichment** queries up to 100 IPs in one HTTP request, free, no signup. Results cached for 7 days per subnet. Falls back to offline ASN keyword matching (via `iptoasn.com` database) if the API is unreachable.\n\n## Per-IP scoring (distributed scraping)\n\nThe subnet pass has an architectural limit: when bot operators spread requests across **many cloud IPs, 1-2 requests each**, no /24 accumulates enough volume to trip. Since **v1.1**, an opt-in second pass scores each IP on its own behavioral fingerprint.\n\n```ini\n# /etc/nginx-autoblock/config.env\nper_ip_enabled=true\nper_ip_threshold=9\ninternal_ref_hosts=example.com,www.example.com   # for noref/extref signal\nself_ips=203.0.113.1                              # your origin IP(s)\n```\n\nThen either let the regular cron run pick it up (subnet pass runs first, then per-IP pass), or invoke it directly:\n\n```bash\nsudo autoblock --show-per-ip   # diagnostic — top 50 candidates, read-only\nsudo autoblock --per-ip --dry-run   # what would be blocked\nsudo autoblock --per-ip   # actually block\n```\n\nOutput goes to `/etc/nginx/blocked-ips.conf` — separate from the subnet file. Both are included in the same `geo $blocked_subnet` block (see `nginx/blacklist.conf`).\n\n### Signal set (path-agnostic)\n\n| Signal | Trigger | Points | Min req |\n|--------|---------|--------|---------|\n| **noassets** | Asset-loading ratio \u003c 5% | +3 | N ≥ 3 |\n| **noref** | No-referer ratio \u003e 80% | +2 | N ≥ 2 |\n| **extref** | External-referer ratio \u003e 50% | +1 | has-ref ≥ 3 |\n| **4xx** | 4xx-response ratio \u003e 30% | +1 | N ≥ 5 |\n| **upath** | Unique-paths ratio ≥ 95% | +2 | N ≥ 5 |\n| **cloud** | ASN description matches hosting/cloud keywords | +3 | — |\n| **ua:oldchrome** | Chrome major version \u003c threshold (default 142) | +2 | — |\n| **ua:headless** | UA matches HeadlessChrome / Puppeteer / Selenium / Scrapy | +3 | — |\n| **ua:short** | UA length \u003c 20 | +2 | — |\n\n**Maximum score: 14.** Default threshold: 9. Whitelisted UAs (Privacy Preserving Prefetch Proxy, imgix, monitoring services, claimed search-engine bots) skip scoring entirely.\n\nThe first 3 path-volume signals (noassets/noref/upath) require multiple requests to fire. The cloud/UA signals work at N=1 — they're what catches single-hit distributed scrapers.\n\n### When to enable\n\nEnable the per-IP pass when you observe **either**:\n- Your access log shows many distinct cloud IPs each hitting one specific endpoint (e.g., `/reservation/\u003cUUID\u003e`, `/product/\u003cID\u003e`, `/profile/\u003cUSER\u003e`) once each.\n- Session-recording or analytics tools show short bot-like sessions (\u003c 5s, 0 clicks) from many countries / IPs — but `--show-scores` (the subnet pass) finds nothing because no /24 is hot enough.\n\nBacktest details and signal calibration: [docs/SCORING.md § Per-IP pass](docs/SCORING.md#per-ip-pass-opt-in).\nReal-world first-hour results from a Laravel-fronted reference site: [docs/CASE-STUDY.md](docs/CASE-STUDY.md).\n\n## UA-cluster scoring (distributed botnets)\n\nBoth the subnet and per-IP passes score IPs **in isolation**. A distributed\nbotnet defeats both by design — hundreds of IPs, ~1 request each, every IP\nindividually innocent. But the botnet rotates a **tiny pool of User-Agent\nstrings** across its whole fleet. One UA shared by 250 datacenter IPs is not\nsomething a real browser population produces. Since **v1.2**, an opt-in third\npass groups requests by User-Agent and scores the cluster.\n\n```ini\n# /etc/nginx-autoblock/config.env\nua_cluster_enabled=true\nua_cluster_min_ips=30       # min distinct IPs sharing a UA to evaluate it\nua_cluster_threshold=7\nua_cluster_min_hosting=0.5  # hosting-ratio gate (see note below)\n```\n\nRun it after the regular cron passes, or directly:\n\n```bash\nsudo autoblock --show-ua-cluster      # diagnostic — flagged clusters, read-only\nsudo autoblock --ua-cluster --dry-run # what would be blocked\nsudo autoblock --ua-cluster           # actually block\n```\n\nOutput goes to `/etc/nginx/blocked-ua-clusters.conf` — a confirmed botnet\ncluster contributes all its member IPs as `/32` bans.\n\n### Signal set\n\n| Signal | Trigger | Points |\n|--------|---------|--------|\n| **host** / **host+** | Cluster hosting-ASN ratio ≥ 50% / ≥ 80% | +2 / +2 additional |\n| **noassets** | Cluster asset-loading ratio \u003c 5% | +3 |\n| **noref** | Cluster no-referer ratio \u003e 80% | +2 |\n| **4xx** | Cluster 4xx-response ratio \u003e 30% | +1 |\n| **ua:headless / oldchrome / short** | UA is a headless tool, old Chrome, or thin | +3 / +2 / +2 |\n\n**Maximum score: 13.** Default threshold: 7. The discriminator is **hosting-ASN\nratio and behavior — never raw IP count**: a current Chrome UA shared by\nthousands of residential users scores 0, while a botnet UA shared by 250\ndatacenter IPs scores 9. Whitelisted and claimed search-engine UAs skip scoring.\n\n**Hosting-ratio gate** (`ua_cluster_min_hosting`, default 0.5): a cluster whose\nIPs are less than that fraction on hosting/datacenter ASNs is never blocked,\nregardless of score. Behind a CDN, static assets are edge-cached so the\n`noassets` signal fires on real-user clusters too — the gate makes hosting-ASN\nratio a necessary condition. Set to 0 only on origins not behind a CDN.\n\nSignal calibration, the hosting-ratio gate, and the May 2026 reference incident: [docs/SCORING.md § UA-cluster pass](docs/SCORING.md#ua-cluster-pass-opt-in).\n\n## Quick install\n\n```bash\ngit clone https://github.com/djeshkov/nginx-autoblock.git\ncd nginx-autoblock\nsudo ./scripts/install.sh\n```\n\nThe installer:\n- Copies `autoblock` to `/usr/local/bin/`\n- Creates `/etc/nginx-autoblock/config.env` from the template\n- Creates `/etc/nginx/blocked-subnets.conf` (empty) and `/etc/nginx/autoblock-whitelist.conf` (template)\n- Installs `/etc/nginx/conf.d/blacklist.conf` (the `geo $blocked_subnet` map)\n- Fetches the ASN database (~9 MB) to `/var/lib/nginx-autoblock/`\n- Installs cron schedule\n\n**Manual nginx step:** add this inside your `server { }` block:\n\n```nginx\nif ($blocked_subnet) {\n    return 444;\n}\n```\n\n(See `nginx/server-snippet.conf`. `444` closes the connection without sending a response — cheapest possible block.)\n\nThen:\n\n```bash\nsudo nginx -t \u0026\u0026 sudo nginx -s reload\nsudo /usr/local/bin/autoblock --dry-run     # see what would block\nsudo /usr/local/bin/autoblock --show-scores # diagnostic — top 30 with score breakdown\n```\n\n## Configuration\n\nEdit `/etc/nginx-autoblock/config.env`. Most important settings:\n\n```ini\naccess_log=/var/log/nginx/access.log\n\n# Tune target_paths to your application — bots hammer specific endpoints.\n# For a typical web app: APIs and search are common targets.\ntarget_paths=/api/,/search\n\n# Exclude paths that look like targets but are legitimate (admin panels, etc.)\nexcluded_paths=/api/admin/\n\n# Volume gate — raise if you have a lot of organic traffic from active power users.\nmin_requests=200\n\n# Score threshold for blocking (max 11).\n# 7 = balanced (default). 8-9 = more conservative (fewer blocks, fewer false positives).\nscore_threshold=7\n\nttl_days=7\n```\n\nFull reference: see `config.example.env`.\n\n## Whitelist\n\n`/etc/nginx/autoblock-whitelist.conf` — CIDRs that are **never** auto-blocked.\n\nThe default template includes:\n- Major search engines (Google, Bing, Yandex, Baidu, DuckDuckGo)\n- AI bots that benefit your AI search visibility (OpenAI ChatGPT-User/GPTBot/SearchBot, Anthropic ClaudeBot)\n- Social crawlers (Facebook, Twitter)\n- Cloudflare ranges (defense-in-depth: if your real_ip module ever breaks, origin sees CF IPs — don't auto-block all your users)\n\n**Always add your own IPs:** office, monitoring services (UptimeRobot, Pingdom), partner API clients, VPN exits used by your team.\n\nTo keep AI bot ranges current, run periodically:\n\n```bash\nsudo ./scripts/refresh-ai-whitelist.sh\n```\n\n## Operating\n\n```bash\n# Default mode (run by cron)\nsudo autoblock\n\n# Dry run — log what would be blocked, don't write\nsudo autoblock --dry-run\n\n# Diagnostic — show top 30 scored subnets with full breakdown\nsudo autoblock --show-scores\n\n# Remove expired bans (runs nightly via cron)\nsudo autoblock --cleanup\n\n# Alternative config\nsudo autoblock --config /path/to/config.env\n```\n\n**Log:** `/var/log/nginx-autoblock.log` — one line per decision (`BLOCK`, `EXTEND`, `UNBLOCK`).\n\n**Unblock a false positive:**\n\n```bash\nsudo vim /etc/nginx/blocked-subnets.conf  # delete the offending line\nsudo nginx -t \u0026\u0026 sudo nginx -s reload\n```\n\nManual entries (lines without an `# auto added=...` comment) are **never** touched by the cleanup job, so you can add permanent bans by hand.\n\n## Known limitations \u0026 risks\n\n- **VPN power users.** A single human using NordVPN/ExpressVPN can match `hosting/proxy + 1 UA`, scoring near the threshold. Realistically rare for most sites, but if your audience is privacy-conscious tech users, monitor `--show-scores` for VPN exits in the score-5 to score-6 range and consider raising `score_threshold` to 8.\n\n- **Mobile app traffic.** A native mobile app sends ONE User-Agent and hits APIs almost exclusively — that's exactly the bot signature. If you have a mobile app, whitelist its backend IPs or the carrier ranges it uses.\n\n- **Partner integrations / cron clients hitting your API.** Same pattern as a bot — one UA, all API. Always whitelist these by IP.\n\n- **Microsoft Azure as a whole** is NOT flagged as hosting by default. This is intentional — many legitimate AI bots (ChatGPT-User, GPTBot) live on Azure, and we'd rather let them through than block ChatGPT. The trade-off: less-known bots from generic Azure subnets are caught only if `ip-api` flags them specifically.\n\n- **Single Cloudflare-fronted setup tested.** The static-ratio caveat assumes a CDN cache in front. For direct-to-origin nginx, you might benefit from re-adding a static-asset-ratio signal — or enable the per-IP pass which uses asset-ratio at the individual-IP level.\n\n- **Per-IP pass trusts claimed-bot UAs without PTR verification** as of v1.1. If a scraper spoofs `Googlebot` in its User-Agent, the per-IP pass currently skips it. Full PTR + forward-DNS verification is implementation-ready and tracked for v1.2. Until then, the subnet pass still catches concentrated spoofers, and the UA whitelist for AI bots is separately verified via published IP ranges (`scripts/refresh-ai-whitelist.sh`).\n\n## Data sources\n\n- **ip2asn-combined.tsv.gz** — from [iptoasn.com](https://iptoasn.com/), free, no signup, daily updates. ~700k entries (522k IPv4 + 176k IPv6 ranges).\n- **ip-api.com** — free tier, 45 batch requests/min, no signup. Used for `proxy`/`hosting`/`mobile` flags on candidate subnets.\n- **OpenAI bot ranges** — official JSON at `openai.com/chatgpt-user.json` (and similar for GPTBot, OAI-SearchBot).\n- **Cloudflare ranges** — official at `cloudflare.com/ips-v4` and `ips-v6`.\n\nAll data fetched at runtime / install time. No vendor secrets, no API keys required for default operation.\n\n## Contributing\n\nContributions welcome — bug reports, feature ideas, code, docs improvements. See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, code style, and what kinds of contributions are most useful.\n\n- **Bugs**: open an [issue](https://github.com/djeshkov/nginx-autoblock/issues/new?template=bug_report.yml).\n- **Feature ideas / new signals**: open an [issue](https://github.com/djeshkov/nginx-autoblock/issues/new?template=feature_request.yml).\n- **Questions / tuning advice / sharing configs**: open a [Discussion](https://github.com/djeshkov/nginx-autoblock/discussions).\n- **Security vulnerabilities**: see [SECURITY.md](SECURITY.md) — please do **not** file public issues.\n\n## License\n\nMIT. See [LICENSE](LICENSE).\n\n## Acknowledgements\n\nInspired by frustration with distributed bot crawls slipping past `limit_req_zone $binary_remote_addr` and observation that headless-Chrome bots show up in Google Analytics as \"real users\" while staying nearly invisible in server-log top-IP statistics.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdjeshkov%2Fnginx-autoblock","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdjeshkov%2Fnginx-autoblock","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdjeshkov%2Fnginx-autoblock/lists"}