https://github.com/djeshkov/nginx-autoblock

Behavioral subnet autoblocker for Nginx — composite scoring + free IP reputation
https://github.com/djeshkov/nginx-autoblock

bot-blocker cloudflare ip-reputation nginx rate-limiting security web-scraping-protection

Last synced: about 2 months ago
JSON representation

Behavioral subnet autoblocker for Nginx — composite scoring + free IP reputation

Host: GitHub
URL: https://github.com/djeshkov/nginx-autoblock
Owner: djeshkov
License: mit
Created: 2026-05-14T15:00:23.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-05-15T13:48:13.000Z (2 months ago)
Last Synced: 2026-05-15T14:28:00.763Z (2 months ago)
Topics: bot-blocker, cloudflare, ip-reputation, nginx, rate-limiting, security, web-scraping-protection
Language: Python
Size: 57.6 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md

Awesome Lists containing this project

README

# nginx-autoblock

Behavioral autoblocker for Nginx. Detects bot crawlers by **composite scoring** across multiple signals (UA diversity, request patterns, IP reputation, behavioral fingerprint) and adds offending **subnets, individual IPs and UA-clusters** to nginx's block-list with TTL.

Designed for three threat classes that per-IP rate-limiting (`limit_req_zone $binary_remote_addr`) misses:
- **Concentrated botnets** — same /24 producing 100+ req/h, each IP individually below per-IP limits (subnet pass, default).
- **Distributed scraping** — hundreds of cloud IPs from many ASNs, 1-2 requests each, mass-scraping public URLs harvested from sitemaps or catalog/product pages (per-IP pass, opt-in since v1.1).
- **Distributed botnets** — hundreds of IPs making ~1 request each while rotating a tiny pool of User-Agent strings; no IP and no /24 stands out, but the shared UA does (UA-cluster pass, opt-in since v1.2).

```
┌─────────────────────────────────────────────────┐
nginx logs ──┤ autoblock (every 10 min via cron) │
│ │
│ Subnet pass (default): │
│ group requests by /24 or /64 │
│ score 0-11 against 5 behavioral signals │
│ enrich via ip-api.com (free) │
│ block /24 if score ≥ 7 │
│ │
│ Per-IP pass (opt-in since v1.1): │
│ score each IP 0-14 (path-agnostic) │
│ catches distributed scrapers (1 req/IP) │
│ block /32 if score ≥ 9 │
│ │
│ UA-cluster pass (opt-in since v1.2): │
│ group by User-Agent, score the cluster │
│ catches distributed botnets (shared UA) │
│ block member /32s if score ≥ 7 │
└────────────────┬────────────────────────────────┘
│
▼
/etc/nginx/blocked-subnets.conf (subnet pass)
/etc/nginx/blocked-ips.conf (per-IP pass)
/etc/nginx/blocked-ua-clusters.conf (UA-cluster pass)
│
▼
nginx returns 444 to bot
```

## Why this exists

Per-IP rate limits (`limit_req_zone $binary_remote_addr`) don't catch distributed crawls: a bot operator with 25 IPs inside one /24 emits 1.5 req/min per IP — far below the per-IP threshold, but ~38 req/min from the subnet in aggregate, with one User-Agent and identical request patterns.

Existing tools occupy adjacent niches:

| Tool | Approach | Limitation |
|------|----------|------------|
| [nginx-ultimate-bad-bot-blocker](https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker) | Static UA/referrer/IP block-lists + fail2ban | Not adaptive — won't catch new bots without list updates |
| [fail2ban-subnets](https://github.com/XaF/fail2ban-subnets) / [recidive-subnet](https://github.com/ruppel/fail2ban-recidive-subnet) | Escalate per-IP bans to /24 when enough hits | Counter-only — no behavioral analysis; depends on per-IP bans firing first |
| [Cloudflare Bot Management](https://developers.cloudflare.com/bots/concepts/bot-score/) | ML scoring 1-99 | Paid, vendor lock-in |

`nginx-autoblock` sits in the middle: **adaptive behavioral scoring with free reputation data**, no fail2ban dependency.

## How it works

For each `/24` (IPv4) or `/64` (IPv6) seen in the last 30 minutes, score against 5 signals (max 11 points). Block if score ≥ 7.

| Signal | Points | What it detects |
|--------|--------|-----------------|
| `≤ 2` unique User-Agents | **+2** | Homogeneous bot farm |
| Target paths ≥ 50% / ≥ 80% of requests | **+1 / +1** additional | Focused API or search hammering |
| Top-3 URLs ≥ 50% / ≥ 80% of requests | **+1 / +1** additional | Low URL diversity (bot vs human browsing) |
| Referer rate < 30% / < 10% | **+1 / +1** additional | Real browsers send referer on link clicks |
| ip-api.com `hosting=true` OR `proxy=true` | **+3** | Datacenter / proxy origin |
| ip-api.com `mobile=true` | **-1** | Mobile carrier — likely real users |

**Gates:**
- Subnet must have ≥ `min_requests` (default 200) in the window — below this, not evaluated.
- Whitelist hits (search engines, AI bots, your own IPs) are skipped before scoring.

**Static-asset ratio is NOT a signal.** Behind a CDN, static files (CSS/JS/images) are served from the edge cache — only ~5% of static traffic reaches origin nginx, so this ratio is similar between humans and bots at origin and provides no discrimination.

**ip-api.com batch enrichment** queries up to 100 IPs in one HTTP request, free, no signup. Results cached for 7 days per subnet. Falls back to offline ASN keyword matching (via `iptoasn.com` database) if the API is unreachable.

## Per-IP scoring (distributed scraping)

The subnet pass has an architectural limit: when bot operators spread requests across **many cloud IPs, 1-2 requests each**, no /24 accumulates enough volume to trip. Since **v1.1**, an opt-in second pass scores each IP on its own behavioral fingerprint.

```ini
# /etc/nginx-autoblock/config.env
per_ip_enabled=true
per_ip_threshold=9
internal_ref_hosts=example.com,www.example.com # for noref/extref signal
self_ips=203.0.113.1 # your origin IP(s)
```

Then either let the regular cron run pick it up (subnet pass runs first, then per-IP pass), or invoke it directly:

```bash
sudo autoblock --show-per-ip # diagnostic — top 50 candidates, read-only
sudo autoblock --per-ip --dry-run # what would be blocked
sudo autoblock --per-ip # actually block
```

Output goes to `/etc/nginx/blocked-ips.conf` — separate from the subnet file. Both are included in the same `geo $blocked_subnet` block (see `nginx/blacklist.conf`).

### Signal set (path-agnostic)

| Signal | Trigger | Points | Min req |
|--------|---------|--------|---------|
| **noassets** | Asset-loading ratio < 5% | +3 | N ≥ 3 |
| **noref** | No-referer ratio > 80% | +2 | N ≥ 2 |
| **extref** | External-referer ratio > 50% | +1 | has-ref ≥ 3 |
| **4xx** | 4xx-response ratio > 30% | +1 | N ≥ 5 |
| **upath** | Unique-paths ratio ≥ 95% | +2 | N ≥ 5 |
| **cloud** | ASN description matches hosting/cloud keywords | +3 | — |
| **ua:oldchrome** | Chrome major version < threshold (default 142) | +2 | — |
| **ua:headless** | UA matches HeadlessChrome / Puppeteer / Selenium / Scrapy | +3 | — |
| **ua:short** | UA length < 20 | +2 | — |

**Maximum score: 14.** Default threshold: 9. Whitelisted UAs (Privacy Preserving Prefetch Proxy, imgix, monitoring services, claimed search-engine bots) skip scoring entirely.

The first 3 path-volume signals (noassets/noref/upath) require multiple requests to fire. The cloud/UA signals work at N=1 — they're what catches single-hit distributed scrapers.

### When to enable

Enable the per-IP pass when you observe **either**:
- Your access log shows many distinct cloud IPs each hitting one specific endpoint (e.g., `/reservation/`, `/product/`, `/profile/`) once each.
- Session-recording or analytics tools show short bot-like sessions (< 5s, 0 clicks) from many countries / IPs — but `--show-scores` (the subnet pass) finds nothing because no /24 is hot enough.

Backtest details and signal calibration: [docs/SCORING.md § Per-IP pass](docs/SCORING.md#per-ip-pass-opt-in).
Real-world first-hour results from a Laravel-fronted reference site: [docs/CASE-STUDY.md](docs/CASE-STUDY.md).

## UA-cluster scoring (distributed botnets)

Both the subnet and per-IP passes score IPs **in isolation**. A distributed
botnet defeats both by design — hundreds of IPs, ~1 request each, every IP
individually innocent. But the botnet rotates a **tiny pool of User-Agent
strings** across its whole fleet. One UA shared by 250 datacenter IPs is not
something a real browser population produces. Since **v1.2**, an opt-in third
pass groups requests by User-Agent and scores the cluster.

```ini
# /etc/nginx-autoblock/config.env
ua_cluster_enabled=true
ua_cluster_min_ips=30 # min distinct IPs sharing a UA to evaluate it
ua_cluster_threshold=7
ua_cluster_min_hosting=0.5 # hosting-ratio gate (see note below)
```

Run it after the regular cron passes, or directly:

```bash
sudo autoblock --show-ua-cluster # diagnostic — flagged clusters, read-only
sudo autoblock --ua-cluster --dry-run # what would be blocked
sudo autoblock --ua-cluster # actually block
```

Output goes to `/etc/nginx/blocked-ua-clusters.conf` — a confirmed botnet
cluster contributes all its member IPs as `/32` bans.

### Signal set

| Signal | Trigger | Points |
|--------|---------|--------|
| **host** / **host+** | Cluster hosting-ASN ratio ≥ 50% / ≥ 80% | +2 / +2 additional |
| **noassets** | Cluster asset-loading ratio < 5% | +3 |
| **noref** | Cluster no-referer ratio > 80% | +2 |
| **4xx** | Cluster 4xx-response ratio > 30% | +1 |
| **ua:headless / oldchrome / short** | UA is a headless tool, old Chrome, or thin | +3 / +2 / +2 |

**Maximum score: 13.** Default threshold: 7. The discriminator is **hosting-ASN
ratio and behavior — never raw IP count**: a current Chrome UA shared by
thousands of residential users scores 0, while a botnet UA shared by 250
datacenter IPs scores 9. Whitelisted and claimed search-engine UAs skip scoring.

**Hosting-ratio gate** (`ua_cluster_min_hosting`, default 0.5): a cluster whose
IPs are less than that fraction on hosting/datacenter ASNs is never blocked,
regardless of score. Behind a CDN, static assets are edge-cached so the
`noassets` signal fires on real-user clusters too — the gate makes hosting-ASN
ratio a necessary condition. Set to 0 only on origins not behind a CDN.

Signal calibration, the hosting-ratio gate, and the May 2026 reference incident: [docs/SCORING.md § UA-cluster pass](docs/SCORING.md#ua-cluster-pass-opt-in).

## Quick install

```bash
git clone https://github.com/djeshkov/nginx-autoblock.git
cd nginx-autoblock
sudo ./scripts/install.sh
```

The installer:
- Copies `autoblock` to `/usr/local/bin/`
- Creates `/etc/nginx-autoblock/config.env` from the template
- Creates `/etc/nginx/blocked-subnets.conf` (empty) and `/etc/nginx/autoblock-whitelist.conf` (template)
- Installs `/etc/nginx/conf.d/blacklist.conf` (the `geo $blocked_subnet` map)
- Fetches the ASN database (~9 MB) to `/var/lib/nginx-autoblock/`
- Installs cron schedule

**Manual nginx step:** add this inside your `server { }` block:

```nginx
if ($blocked_subnet) {
return 444;
}
```

(See `nginx/server-snippet.conf`. `444` closes the connection without sending a response — cheapest possible block.)

Then:

```bash
sudo nginx -t && sudo nginx -s reload
sudo /usr/local/bin/autoblock --dry-run # see what would block
sudo /usr/local/bin/autoblock --show-scores # diagnostic — top 30 with score breakdown
```

## Configuration

Edit `/etc/nginx-autoblock/config.env`. Most important settings:

```ini
access_log=/var/log/nginx/access.log

# Tune target_paths to your application — bots hammer specific endpoints.
# For a typical web app: APIs and search are common targets.
target_paths=/api/,/search

# Exclude paths that look like targets but are legitimate (admin panels, etc.)
excluded_paths=/api/admin/

# Volume gate — raise if you have a lot of organic traffic from active power users.
min_requests=200

# Score threshold for blocking (max 11).
# 7 = balanced (default). 8-9 = more conservative (fewer blocks, fewer false positives).
score_threshold=7

ttl_days=7
```

Full reference: see `config.example.env`.

## Whitelist

`/etc/nginx/autoblock-whitelist.conf` — CIDRs that are **never** auto-blocked.

The default template includes:
- Major search engines (Google, Bing, Yandex, Baidu, DuckDuckGo)
- AI bots that benefit your AI search visibility (OpenAI ChatGPT-User/GPTBot/SearchBot, Anthropic ClaudeBot)
- Social crawlers (Facebook, Twitter)
- Cloudflare ranges (defense-in-depth: if your real_ip module ever breaks, origin sees CF IPs — don't auto-block all your users)

**Always add your own IPs:** office, monitoring services (UptimeRobot, Pingdom), partner API clients, VPN exits used by your team.

To keep AI bot ranges current, run periodically:

```bash
sudo ./scripts/refresh-ai-whitelist.sh
```

## Operating

```bash
# Default mode (run by cron)
sudo autoblock

# Dry run — log what would be blocked, don't write
sudo autoblock --dry-run

# Diagnostic — show top 30 scored subnets with full breakdown
sudo autoblock --show-scores

# Remove expired bans (runs nightly via cron)
sudo autoblock --cleanup

# Alternative config
sudo autoblock --config /path/to/config.env
```

**Log:** `/var/log/nginx-autoblock.log` — one line per decision (`BLOCK`, `EXTEND`, `UNBLOCK`).

**Unblock a false positive:**

```bash
sudo vim /etc/nginx/blocked-subnets.conf # delete the offending line
sudo nginx -t && sudo nginx -s reload
```

Manual entries (lines without an `# auto added=...` comment) are **never** touched by the cleanup job, so you can add permanent bans by hand.

## Known limitations & risks

- **VPN power users.** A single human using NordVPN/ExpressVPN can match `hosting/proxy + 1 UA`, scoring near the threshold. Realistically rare for most sites, but if your audience is privacy-conscious tech users, monitor `--show-scores` for VPN exits in the score-5 to score-6 range and consider raising `score_threshold` to 8.

- **Mobile app traffic.** A native mobile app sends ONE User-Agent and hits APIs almost exclusively — that's exactly the bot signature. If you have a mobile app, whitelist its backend IPs or the carrier ranges it uses.

- **Partner integrations / cron clients hitting your API.** Same pattern as a bot — one UA, all API. Always whitelist these by IP.

- **Microsoft Azure as a whole** is NOT flagged as hosting by default. This is intentional — many legitimate AI bots (ChatGPT-User, GPTBot) live on Azure, and we'd rather let them through than block ChatGPT. The trade-off: less-known bots from generic Azure subnets are caught only if `ip-api` flags them specifically.

- **Single Cloudflare-fronted setup tested.** The static-ratio caveat assumes a CDN cache in front. For direct-to-origin nginx, you might benefit from re-adding a static-asset-ratio signal — or enable the per-IP pass which uses asset-ratio at the individual-IP level.

- **Per-IP pass trusts claimed-bot UAs without PTR verification** as of v1.1. If a scraper spoofs `Googlebot` in its User-Agent, the per-IP pass currently skips it. Full PTR + forward-DNS verification is implementation-ready and tracked for v1.2. Until then, the subnet pass still catches concentrated spoofers, and the UA whitelist for AI bots is separately verified via published IP ranges (`scripts/refresh-ai-whitelist.sh`).

## Data sources

- **ip2asn-combined.tsv.gz** — from [iptoasn.com](https://iptoasn.com/), free, no signup, daily updates. ~700k entries (522k IPv4 + 176k IPv6 ranges).
- **ip-api.com** — free tier, 45 batch requests/min, no signup. Used for `proxy`/`hosting`/`mobile` flags on candidate subnets.
- **OpenAI bot ranges** — official JSON at `openai.com/chatgpt-user.json` (and similar for GPTBot, OAI-SearchBot).
- **Cloudflare ranges** — official at `cloudflare.com/ips-v4` and `ips-v6`.

All data fetched at runtime / install time. No vendor secrets, no API keys required for default operation.

## Contributing

Contributions welcome — bug reports, feature ideas, code, docs improvements. See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, code style, and what kinds of contributions are most useful.

- **Bugs**: open an [issue](https://github.com/djeshkov/nginx-autoblock/issues/new?template=bug_report.yml).
- **Feature ideas / new signals**: open an [issue](https://github.com/djeshkov/nginx-autoblock/issues/new?template=feature_request.yml).
- **Questions / tuning advice / sharing configs**: open a [Discussion](https://github.com/djeshkov/nginx-autoblock/discussions).
- **Security vulnerabilities**: see [SECURITY.md](SECURITY.md) — please do **not** file public issues.

## License

MIT. See [LICENSE](LICENSE).

## Acknowledgements

Inspired by frustration with distributed bot crawls slipping past `limit_req_zone $binary_remote_addr` and observation that headless-Chrome bots show up in Google Analytics as "real users" while staying nearly invisible in server-log top-IP statistics.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/djeshkov/nginx-autoblock

Awesome Lists containing this project

README