https://github.com/dgl/haphash
Anti-scraper challenge for haproxy to stop naughty AI bots.
https://github.com/dgl/haphash
haproxy waf
Last synced: 17 days ago
JSON representation
Anti-scraper challenge for haproxy to stop naughty AI bots.
- Host: GitHub
- URL: https://github.com/dgl/haphash
- Owner: dgl
- License: 0bsd
- Created: 2025-04-28T01:19:13.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-08-02T01:41:42.000Z (6 months ago)
- Last Synced: 2025-10-11T02:42:23.772Z (4 months ago)
- Topics: haproxy, waf
- Language: HTML
- Homepage: https://dgl.cx/2025/04/using-haproxy-to-stop-scrapers
- Size: 11.7 KB
- Stars: 59
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: COPYING
Awesome Lists containing this project
README
# haphash: Anti-scraper for haproxy
This is a simple anti-scraper solution for [haproxy](https://www.haproxy.org),
using a similar "hashcash" challenge as
[anubis](https://xeiaso.net/blog/2025/anubis/) uses. The goal is to be as
simple as possible, so this can be implemented alongside other haproxy rules to
control traffic.
## Overview
AI crawlers keep [breaking the
web](https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/).
I mostly have avoided this problem by having very lightweight pages, but lately
I've noticed some scrapers are being particularly obnoxious. Many solutions to
this problem involve adding another proxy component, but I'm already running
haproxy in most places, which is a perfectly fine reverse proxy and I don't
want to make things more complex if I can avoid it.
This uses a haproxy ["stick
table"](https://www.haproxy.com/blog/introduction-to-haproxy-stick-tables) to
store details of IP addresses. It is based on simply allowing IP addresses,
rather than cookies. As the IP address is stored in memory and there's no
cookie, this likely does not add to any GDPR obligations (this is not legal
advice).
It is expected this will be combined with haproxy IP based [rate
limiting](https://www.haproxy.com/blog/four-examples-of-haproxy-rate-limiting),
with the benefit that this doesn't add another component to the system.
If you want to try it out, my [contact](https://dgl.cx/contact) page is always
protected by it.
## The moving parts
[`challenge.html`](challenge.html) is the HTML served to clients, templated via
haproxy. (Because this is templated you can't just open it in your browser --
note the double percent signs.)
[`haproxy.conf`](haproxy.conf) is a haproxy config snippet that makes use of
this. It's expected you adjust this for your implementation. The "challenge"
backend is where the majority of the logic lives and should only need tiny
changes.
This is small:
```console
$ wc -l haproxy.conf challenge.html
38 haproxy.conf
94 challenge.html
132 total
```
## Set-up
Copy [`challenge.html`](challenge.html) to `/etc/haproxy/challenge.html` (or
other suitable location).
From [haproxy.conf](haproxy.conf) add the `challenge` backend to your haproxy
configuration. Add the relevant lines from `frontend www` to your frontend
section.
To start with it is recommended you protect a single path for testing purposes.
Restarting haproxy will clear the stick table (configure
[peers](https://www.haproxy.com/documentation/haproxy-configuration-tutorials/proxying-essentials/custom-rules/stick-tables/#synchronize-stick-tables-across-peers)
to make the allowed IP addresses persist).
The difficulty is set in both the HTML and the haproxy config, it defaults to 4
(which is pretty fast).
## License
©[David Leadbeater](https://一.st) 2025; [0BSD](https://dgl.cx/0bsd), see
[COPYING](COPYING).
## Alternatives
* [haproxy enterprise](https://www.haproxy.com/documentation/haproxy-configuration-tutorials/security/enterprise-features/)
* [tedu's anticrawl](https://flak.tedunangst.com/post/anticrawl)
* [anubis](https://anubis.techaro.lol/)
* [go-away](https://git.gammaspectra.live/git/go-away)
* [haproxy-protection](https://gitgud.io/fatchan/haproxy-protection/)
## Credits
* [Xe Iaso](https://xeiaso.net/) for anubis.