https://github.com/hubzero/a11y-catscan
Multi-Engine WCAG Compliance Crawler
https://github.com/hubzero/a11y-catscan
Last synced: 4 days ago
JSON representation
Multi-Engine WCAG Compliance Crawler
- Host: GitHub
- URL: https://github.com/hubzero/a11y-catscan
- Owner: hubzero
- License: other
- Created: 2026-04-20T22:38:03.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-04T19:47:04.000Z (about 1 month ago)
- Last Synced: 2026-05-04T21:33:13.470Z (about 1 month ago)
- Language: Python
- Size: 1.11 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
Multi-engine accessibility scans that survive real crawls.
a11y-catscan crawls a website with Playwright and runs four
accessibility engines — axe-core, Siteimprove Alfa, IBM Equal
Access, and HTML_CodeSniffer — sharing one Chromium instance.
Findings are deduped across engines, streamed to JSONL/HTML/JSON
reports, and exposed as MCP tools so an LLM can analyze them
directly.
**Status: beta.** Production-shaped, exercising in dev; recovery
cycle and worker pool work end-to-end on multi-thousand-page
authenticated crawls. Architecture and per-module design notes
live in [DESIGN.md](DESIGN.md). Site handbook is rendered to
GitHub Pages from `docs-src/`; see the [documentation index](#documentation)
below.
## What's shipped
- **Four scan engines.** axe-core (Deque), Siteimprove Alfa
(ACT-rules native), IBM Equal Access, HTML_CodeSniffer. Run
one or combine them — `--engine axe,alfa,ibm,htmlcs` — all
sharing one Chromium so a multi-engine scan isn't 4× the
page loads. Each finding carries an `engine` attribution.
- **Cross-engine dedup.** Findings sharing
`(selector, primary-tag, outcome)` collapse into one entry
with `engines: {axe: ..., ibm: ...}` and per-engine impact
upgraded to the worst severity. EARL outcomes
(`failed` / `cantTell` / `passed` / `inapplicable`) are the
internal vocabulary.
- **Streaming reports.** JSONL is written one page per line so
memory stays flat across 5000-page crawls; HTML and the
LLM-friendly markdown summary stream from disk on demand.
- **Sliding-window async crawler.** N-worker pool with one
Chromium, periodic browser restart for memory hygiene
(`restart_every`), atomic state save (`--resume`), graceful
shutdown on SIGTERM/SIGINT, on-demand snapshot via SIGUSR1.
- **Authenticated scans with mid-scan session recovery.** A
Python login plugin authenticates once, the saved session
state shortcuts subsequent starts, and if the session expires
mid-crawl the scanner drains workers, re-logs-in, bans
detected logout-trap URLs, and resumes. Persistent re-login
failure trips a circuit breaker so the crawl exits instead
of looping.
- **Allowlist with engine + outcome filters.** YAML allowlist
suppresses known-acceptable findings by rule, URL, target,
engine, and outcome — all AND'd. O(1) average lookup via a
rule-id index.
- **MCP server.** `--mcp` exposes
`scan_page` / `analyze_report` / `find_issues` / `check_page`
/ `compare_scans` / `manage_scans` / `lookup_wcag` /
`list_engines` as Claude Code tools. URL-scheme validated to
http(s).
- **Diff and rescan workflows.** `--diff PREV.jsonl` shows
fixed/new/remaining findings; `--rescan PREV.jsonl` re-scans
only pages that previously had issues; `--violations-from`
/ `--incompletes-from` extract specific URL sets from prior
reports.
- **Group-by analysis.** `--group-by {rule, selector, color,
reason, wcag, level, engine, bp}` prints a sorted summary
with per-group page counts and one example.
- **Niceness + OOM-resistance.** Defaults to `nice 10` and
`oom_score_adj=1000` so the scanner doesn't starve
production services on shared hosts.
## Quick start
Requires Python 3.12 and Node.js 18+.
```sh
pip install -e . # installs playwright, pyyaml, mcp
playwright install chromium
npm install # bundles the four engines
```
Scan one URL:
```sh
./a11y-catscan.py --page https://example.com/
```
Crawl with all four engines, write LLM-friendly report:
```sh
./a11y-catscan.py --engine all --max-pages 500 --llm \
https://example.com/
```
Compare against last week's baseline:
```sh
./a11y-catscan.py --diff baseline.jsonl --max-pages 500 \
https://example.com/
```
Full setup walkthrough in [`docs-src/getting-started.md`](docs-src/getting-started.md).
## Documentation
Site handbook (rendered to
[hubzero.github.io/a11y-catscan](https://hubzero.github.io/a11y-catscan/)
from these sources):
| Topic | Source |
|---|---|
| Getting started — install, first scan, exit codes | [`docs-src/getting-started.md`](docs-src/getting-started.md) |
| Configuration — every YAML setting + CLI override | [`docs-src/configuration.md`](docs-src/configuration.md) |
| Scan workflows — crawl, page, urls, rescan, diff, resume | [`docs-src/scan-workflows.md`](docs-src/scan-workflows.md) |
| Reports — JSON, JSONL, HTML, LLM markdown formats | [`docs-src/reports.md`](docs-src/reports.md) |
| Authentication — login plugin, session recovery, logout traps | [`docs-src/authentication.md`](docs-src/authentication.md) |
| MCP server — tool surface for Claude Code | [`docs-src/mcp.md`](docs-src/mcp.md) |
| Troubleshooting | [`docs-src/troubleshooting.md`](docs-src/troubleshooting.md) |
| FAQ | [`docs-src/faq.md`](docs-src/faq.md) |
Internal references:
- [DESIGN.md](DESIGN.md) — current-state design specification
- [CHANGELOG.md](CHANGELOG.md) — date-organized log of changes
## Engines
| Engine | Flag | Type | License |
|---|---|---|---|
| [axe-core](https://github.com/dequelabs/axe-core) (Deque) | `--engine axe` | Browser injection (default) | MPL-2.0 |
| [Siteimprove Alfa](https://github.com/Siteimprove/alfa) | `--engine alfa` | Node.js subprocess via CDP | MIT |
| [IBM Equal Access](https://github.com/IBMa/equal-access) | `--engine ibm` | Browser injection | Apache-2.0 |
| [HTML_CodeSniffer](https://github.com/squizlabs/HTML_CodeSniffer) | `--engine htmlcs` | Browser injection | BSD-3 |
`--engine all` runs all four; engines that aren't listed are
skipped. axe-core, IBM, and HTML_CodeSniffer inject JavaScript
into the live page and run in-browser. Alfa's TypeScript engine
runs as a Node.js subprocess and connects to the shared Chromium
via CDP — no second page load.
## Local development
The full test suite runs against the bundled fixtures:
```sh
pip install -e '.[dev]'
pytest # 368 tests, ~70s with browser
pytest -m "not browser" # 285 fast tests, <10s
```
Coverage is configured in `pyproject.toml`; see
[`tests/`](tests/) for the layout (`test_engine_normalizers.py`,
`test_crawl_loop.py`, `test_mcp_tools.py`, etc.).
## License
MIT. See [LICENSE](LICENSE).
Engine licenses: axe-core (MPL-2.0), Siteimprove Alfa (MIT),
IBM Equal Access (Apache-2.0), HTML_CodeSniffer (BSD-3). The
four engines are vendored via npm and ship under their own
licenses; this repo wraps them.