https://github.com/six-two/mkdocs-anti-ai-scraper-plugin

It has AI in the name, so dear VC people give me lots of money ;) Just kidding, this plugin tries to tell bots not to steal your pages' contents
https://github.com/six-two/mkdocs-anti-ai-scraper-plugin

anti-scraping mkdocs-plugin work-in-progress

Last synced: 9 months ago
JSON representation

It has AI in the name, so dear VC people give me lots of money ;) Just kidding, this plugin tries to tell bots not to steal your pages' contents

Host: GitHub
URL: https://github.com/six-two/mkdocs-anti-ai-scraper-plugin
Owner: six-two
Created: 2025-08-27T17:14:39.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-08-28T18:55:53.000Z (9 months ago)
Last Synced: 2025-08-29T00:56:56.547Z (9 months ago)
Topics: anti-scraping, mkdocs-plugin, work-in-progress
Language: Python
Homepage:
Size: 24.4 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # MkDocs Anti AI Scraper Plugin

This plugin tries to prevent AI scrapers from easily ingesting your website's contents.

It is probably implemented pretty badly and by design it can be bypassed by anyone that invests a bit of time, but it is probably better than nothing.

## Installation

Install the plugin with `pip`:

```bash

pip install mkdocs-anti-ai-scraper-plugin

```

Then add the plugin to your `mkdocs.yml`:

```yaml

plugins:

- search

- anti_ai_scraper

```

Or with all config options:

```yaml

plugins:

- search

- anti_ai_scraper:

    robots_txt: True

    sitemap_xml: True

    encode_html: True

    debug: False

```

## Implemented Techniques

Technique | Scraper Protection | Impact on human visitors | Enabled by default

--- | --- | --- | ---

Add robots.txt | weak | none | yes

Remove sitemap.xml | very weak | none | yes

Encode HTML | only against simple HTML parser based scrapers | slows down page loading, may break page events | true

### Add robots.txt

This technique is enabled by default, and can be disabled by setting the option `robots_txt: False` in `mkdocs.yml`.

If enabled, it adds a `robots.txt` with the following contents to the output directory:

```

User-agent: *

Disallow: /

```

This hints to crawlers that they should not crawl your site.

This technique does not hinder normal users from using the site at all.

However, the `robots.txt` is not enforcing anything.

It just tells well-behaved bots how you would like them to behave.

Many AI bots may just ignore it ([Source](https://www.tomshardware.com/tech-industry/artificial-intelligence/several-ai-companies-said-to-be-ignoring-robots-dot-txt-exclusion-scraping-content-without-permission-report)).

### Remove sitemap.xml

This technique is enabled by default, and can be disabled by setting the option `robots_txt: False` in `mkdocs.yml`.

If enabled, it removes the `sitemap.xml` and `sitemap.xml.gz` files.

This prevents leaking the paths to pages not referenced by your navigation.

### Encode HTML

This technique is enabled by default, and can be disabled by setting the option `robots_txt: False` in `mkdocs.yml`.

If enabled, it encodes (zip + ASCII85) each page's contents and will decode it in the user's browser with JavaScript.

This obscures the page contents to simple scrapers that just download and parse your HTML.

It will not work against any bots that use remote controlled browsers (using selenium or other tech).

The decoding takes some time and will result in browser events (like `onload`) being fired before the page is decoded.

This may break some functionality, that listens to these events and expects them to happen.

## Planned Techniques

- Encrypt page contents and adding client side "CAPTCHA" to generate the key: Should help against primitive browser based bots.

    It would probably make sense to just let the user solve the CAPTCHA once and cache the key as a cookie or in `localStorage`.

- Bot detection JS: Will be a cat and mouse game, but should help against badly written crawlers

Suggestions welcome: If you know bot detection mechanisms, that can be used with static websites, feel free to open an issue :D

## Problems and Considerations

- Similar to the [encryption plugin](https://github.com/unverbuggt/mkdocs-encryptcontent-plugin), the encryption of the search index is hard.

    So best disable search to prevent anyone from accessing its index.

- Obviously, to protect your contents from scraping, you should not have their source code hosted in public repos ;D

- By blocking bots, you also prevent search engines like Google from properly indexing your site.

## Notable changes

### Version 0.1.0

- Added `encode_html` option

- Added `sitemap_xml` option

### Version 0.0.1

- Added `robots_txt` option

## Development Commands

This repo is managed using [poetry](https://github.com/python-poetry/poetry?tab=readme-ov-file).

You can install `poetry` with `pip install poetry` or `pipx install poetry`.

Clone repo:

```bash

git clone git@github.com:six-two/mkdocs-anti-ai-scraper-plugin.git

```

Install/update extension locally:

```bash

poetry install

```

Build test site:

```bash

poetry run mkdocs build

```

Serve test site:

```bash

poetry run mkdocs serve

```

### Release

Set PyPI API token (only needed once):

```bash

poetry config pypi-token.pypi YOUR_PYPI_TOKEN_HERE

```

Build extension:

```bash

poetry build

```

Upload extension:

```bash

poetry publish

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/six-two/mkdocs-anti-ai-scraper-plugin

Awesome Lists containing this project

README