https://github.com/Dicklesworthstone/markdown_web_browser
Renders any URL via headless Chrome, tiles screenshots into OCR slices, and streams structured Markdown + provenance back to AI agents and pipelines
https://github.com/Dicklesworthstone/markdown_web_browser
ai-agents markdown ocr python web-scraping
Last synced: 1 day ago
JSON representation
Renders any URL via headless Chrome, tiles screenshots into OCR slices, and streams structured Markdown + provenance back to AI agents and pipelines
- Host: GitHub
- URL: https://github.com/Dicklesworthstone/markdown_web_browser
- Owner: Dicklesworthstone
- License: other
- Created: 2025-11-07T17:30:25.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2026-07-02T02:16:58.000Z (3 days ago)
- Last Synced: 2026-07-02T04:15:25.742Z (3 days ago)
- Topics: ai-agents, markdown, ocr, python, web-scraping
- Language: Python
- Size: 6.62 MB
- Stars: 170
- Watchers: 2
- Forks: 27
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
README
# Markdown Web Browser πβπ
> **Transform any website into clean, proveable Markdown with full OCR accuracy**
Render any URL with deterministic Chrome-for-Testing, tile screenshots into OCR-friendly slices, and stream structured Markdown + provenance back to AI agents, web apps, and automation pipelines.
**Two ways to use it:**
1. **π Browser UI** (`/browser`) - Interactive web browsing with navigation history, address bar, and dual markdown views
2. **βοΈ CLI + API** - Programmatic capture for automation, batch processing, and agent workflows
## π― Example: finviz.com (Financial Market Dashboard)
**The Challenge:** Finviz.com is protected by **Cloudflare bot detection** that blocks traditional web scrapers. Our system bypasses this with comprehensive stealth techniques.
### BEFORE: Original Website

### AFTER: Clean Markdown (Two Views)
**Rendered Markdown** - Beautiful GitHub-styled formatting:

**Raw Markdown** - Syntax-highlighted source with full provenance:

### Sample Output (truncated - actual output is 223 lines)
```markdown
This screenshot from Finviz shows a financial visualization dashboard with various stock market indices and tickers, as well as a color-coded sector heatmap. Here's a breakdown of the main sections:
1. **Indices and Charts:**
- **DOW:** Nov 7, +74.80 (0.16%), 46987.1
- **NASDAQ:** Nov 7, -49.46 (0.21%), 23004.5
- **S&P 500:** Nov 7, +8.48 (0.13%), 6728.80
2. **Advancing vs Declining Stocks:**
- Advancing: 56.0% (3116)
- Declining: 40.5% (2254)
- New High: 19.5% (110)
- New Low: 19.5% (110)
3. **Top Gainers and Top Losers:**
- Top Gainers:
- MSGM: 70.78%
- BKYI: 51.57%
- GIFJ: 49.68%
- ORGO: 44.73%
- Top Losers:
- DTCK: -77.93%
- ENGS: -54.55%
- ELDN: -49.76%
- MEHA: -46.93%
4. **Sector Heatmap:**
- Shows color-coded sectors such as Technology, Consumer Cyclical, Communication Services, Industrials, etc.
5. **Headlines:**
- 05:15PM What private data says about America's job engine
- 03:39PM The 'buy everything' rally now feels like an uphill battle, putting bull market to the test
- Nov-07 Stock Market News, Nov. 7, 2025: Nasdaq Has Its Worst Week Since April
### Insider Trading:
- OSIS: Morben Paul Keith (PRES., OPTOELEC) sold 416 shares at $279.10, valued at $116,106.
- OSIS: HAWKINS JAMES B (Director) sold 1,500 shares at $283.15, valued at $424,725.
- EL: Leonard A. Lauder 20 (10% Owner) sold 2,786,040 shares at $89.70, valued at $249,907,788.
### Futures Prices:
- Crude Oil: Last 59.78, Change +0.03 (+0.05%)
- Natural Gas: Last 4.4530, Change -0.1380 (+3.20%)
- Gold: Last 4019.50, Change +9.70 (+0.24%)
- Dow: Last 47250.00, Change +165.00 (+0.35%)
- S&P 500: Last 6786.50, Change +32.75 (+0.48%)
- Nasdaq 100: Last 25345.50, Change +179.25 (+0.71%)
```
### What makes this work:
- β
**Bypasses Cloudflare bot detection** - Chrome's `--headless=new` mode (undetectable) + 60+ lines of stealth JavaScript
- β
**Comprehensive fingerprint masking** - navigator.webdriver, plugins, permissions API, hardware specs
- β
**Captures 4 tiles** with overlapping regions for seamless stitching
- β
**95%+ OCR accuracy** - Extracts all stock tickers, prices, and percentages
- β
**Full provenance** - Every section links back to exact pixel coordinates
- β
**Works on protected sites** - finviz.com (Cloudflare), financial dashboards, SPAs
## π Browser UI - Interactive Web Browsing
**NEW:** A complete browser-like interface for viewing web pages as clean, readable markdown in real-time.
### Access the Browser
Once installed, navigate to:
```
http://localhost:8000/browser
```
### Features
- **π Smart Address Bar**: Enter any URL or search term (auto-detects and searches Google)
- **β¬
οΈβ‘οΈ Navigation History**: Back/forward buttons with full browsing history
- **π Refresh**: Force reload to bypass cache
- **ποΈ Dual View Modes**:
- **Rendered**: Beautiful GitHub-styled markdown with proper formatting
- **Raw**: Syntax-highlighted markdown source (Prism.js)
- **β‘ Smart Caching**: Pages cached for 1 hour for instant repeat visits
- **π Real-time Progress**: Live tile processing updates with progress bars
- **β¨οΈ Keyboard Shortcuts**:
- `Alt+Left/Right` - Navigate back/forward
- `Ctrl+R` - Refresh page
- `Ctrl+U` - Toggle rendered/raw view
- `Ctrl+L` - Focus address bar
### How It Works
1. Enter a URL (e.g., `https://news.ycombinator.com`) or search term
2. Backend captures page as tiled screenshots via headless Chrome
3. OCR extracts text from each tile with 95%+ accuracy
4. Markdown is stitched together with provenance tracking
5. View rendered markdown OR syntax-highlighted raw source
6. Navigate back/forward like a real browser
**Perfect for:**
- Reading web content without distractions
- AI agents browsing websites without vision models
- Archiving web pages as clean markdown
- Research with full history and navigation
π **Full documentation**: See `docs/BROWSER_UI.md` for detailed features, keyboard shortcuts, and troubleshooting.
---
## π Why This Matters
### π€ Perfect for AI Agents
- **Deterministic output**: Same input = same markdown every time
- **Verifiable provenance**: Every sentence links back to exact pixel coordinates
- **Rich metadata**: Links, headings, tables extracted from both DOM and visuals
- **OCR + DOM fusion**: Catches content missed by traditional scrapers
### π vs. Traditional Solutions
| Method | Visual Accuracy | Provenance | Deterministic | Complex Layouts |
|--------|----------------|-------------|---------------|-----------------|
| **Markdown Web Browser** | β
95%+ | β
Pixel-level | β
Chrome-for-Testing | β
OCR + DOM |
| Puppeteer + Readability | β 60% | β None | β οΈ Browser variance | β DOM-only |
| BeautifulSoup | β 40% | β None | β
Yes | β No visuals |
| Selenium screenshots | β
90% | β None | β Driver variance | β οΈ Manual OCR |
### π― Real-World Use Cases
**AI Research & Analysis**
- Process 10,000+ financial reports/day with 95% accuracy
- Extract data from PDFs, SPAs, and interactive dashboards
- Archive regulatory filings with full audit trails
**Content Intelligence**
- Monitor competitor websites with pixel-perfect change detection
- Extract structured data from news sites, forums, and social platforms
- Generate documentation from live web applications
**Compliance & Legal**
- Create admissible evidence with cryptographic provenance
- Archive website states for regulatory submissions
- Track website changes with timestamped, verifiable records
## π Quick Install (One Command)
Get started in under 2 minutes with our automated installer:
```bash
curl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/markdown_web_browser/main/install.sh | bash -s -- --yes
```
**What this installer does for you:**
1. **Checks your system** - Detects your OS (Ubuntu/Debian, macOS, RHEL, Arch)
2. **Installs uv package manager** - The modern Python package manager from Astral
3. **Installs system dependencies** - Automatically installs libvips (image processing library)
4. **Clones the repository** - Downloads the latest Markdown Web Browser code
5. **Sets up Python 3.13 environment** - Creates isolated virtual environment with all dependencies
6. **Installs Playwright browsers** - Downloads Chrome for Testing with bot detection evasion built-in
7. **Configures environment** - Sets up `.env` file with default settings
8. **Runs verification tests** - Ensures everything is working correctly
9. **Creates launcher script** - Provides a convenient `mdwb` command for CLI usage
For interactive installation or custom options:
```bash
# Interactive mode (prompts for each step)
curl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/markdown_web_browser/main/install.sh | bash
# Custom directory with OCR API key
curl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/markdown_web_browser/main/install.sh | bash -s -- \
--dir=/opt/mdwb --ocr-key=sk-YOUR-API-KEY
# See all options
curl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/markdown_web_browser/main/install.sh | bash -s -- --help
```
## Why it exists
- **Screenshot-first:** Captures exactly what users seeβno PDF/print CSS surprises.
- **Deterministic + auditable:** Every run emits tiles, `out.md`, `links.json`, and `manifest.json` (with CfT label/build, Playwright version, screenshot style hash, warnings, and timings).
- **Agent-friendly extras:** DOM-derived `links.json`, sqlite-vec embeddings, SSE/NDJSON feeds, and CLI helpers so builders can consume Markdown immediately.
- **Ops-ready:** Python 3.13 + FastAPI + Playwright with uv packaging, structured settings via `python-decouple`, telemetry hooks, and smoke/latency automation.
## Architecture at a glance
1. **User Interface Layer**:
- Browser UI (`/browser`) - Interactive browsing with history, view toggling, and real-time progress
- Job Dashboard (`/`) - HTMX-based monitoring with SSE live updates
- CLI + API - Programmatic access for automation and agents
2. FastAPI `/jobs` endpoint enqueues a capture via the `JobManager`.
3. Playwright (Chromium CfT, viewport 1280Γ2000, DPR 2, reduced motion) performs a deterministic viewport sweep.
4. `pyvips` slices sweeps into β€1288 px tiles with β120 px overlap; each tile carries offsets, DPR, hashes.
5. The OCR client submits tiles (HTTP/2) to hosted or local olmOCR, with retries + concurrency autotune.
6. Stitcher merges Markdown, aligns headings with the DOM outline, trims overlaps via SSIM + fuzzy text comparisons, injects provenance comments (with tile metadata + highlight links), and builds the Links Appendix.
7. `Store` writes artifacts under a content-addressed path and updates sqlite + sqlite-vec metadata for embeddings search.
8. `/jobs/{id}`, `/jobs/{id}/stream`, `/jobs/{id}/events`, `/jobs/{id}/links.json`, etc., feed all UI layers (Browser UI, Dashboard, CLI) with consistent data.
9. The job dashboard relies on the HTMX SSE extension for real-time updates (state, manifest, warning pills), while the Browser UI uses JavaScript polling for simplicity.
See `PLAN_TO_IMPLEMENT_MARKDOWN_WEB_BROWSER_PROJECT.md` Β§Β§2β5, 19 for the full breakdown.
## π― Your First Successful Capture (5 minutes)
### Option A: Browser UI (Easiest)
1. **Start the server:**
```bash
mdwb demo stream
```
2. **Open your browser:**
```
http://localhost:8000/browser
```
3. **Enter a URL or search term:**
- Try: `https://example.com` or just search `markdown tutorial`
- Watch the progress bar as tiles are processed
- Toggle between rendered and raw markdown views
**β
Success indicators:**
- Clean markdown appears in ~30-60 seconds
- Navigation buttons work (back/forward)
- Toggle switches between rendered/raw views
### Option B: CLI (For Automation)
**Step 1: Verify Setup**
```bash
# Test the installation
mdwb demo stream
```
**β
Success indicators:**
- Fake job runs with progress bars
- No import or dependency errors
- Server responds on localhost:8000
**Tip:** Any `mdwb` CLI command that supports `--json` also accepts `--format toon` for TOON output (falls back to JSON if `tru` is unavailable).
**Step 2: Capture a Real Page**
```bash
# Start with a simple page
mdwb fetch https://example.com --watch
```
**β
What you should see:**
```
π Job abc123 submitted successfully
πΈ Screenshots: ββββββββββββ 100% (2/2 tiles)
π€ OCR Processing: ββββββββββββ 100% (completed in 12.4s)
π§΅ Stitching: ββββββββββββ 100% (completed in 0.3s)
β
Job completed successfully in 15.2s
π Markdown saved to: /cache/example.com/abc123/out.md
π Links extracted: /cache/example.com/abc123/links.json
π Full manifest: /cache/example.com/abc123/manifest.json
```
**Step 3: Validate Output Quality**
```bash
# Check the generated markdown
cat /cache/example.com/abc123/out.md
# Verify provenance comments are included
grep "source: tile_" /cache/example.com/abc123/out.md
# Check extracted links
cat /cache/example.com/abc123/links.json | jq '.anchors | length'
```
**β
Quality indicators:**
- Markdown contains readable text (not OCR gibberish)
- Provenance comments show ``
- Links.json contains discovered anchors
- No "ERROR" or "FAILED" in manifest.json
**Step 4: Try a Complex Page with Bot Detection**
```bash
# Test with a real-world site that has Cloudflare protection
mdwb fetch https://finviz.com --watch
```
---
## Manual Installation
1. **Install prerequisites**
- Python 3.13, uv β₯0.8, and the system deps Playwright requires.
- Install system dependencies: `sudo apt-get install libvips-dev` (Ubuntu/Debian) or `brew install vips` (macOS).
- Install the CfT build Playwright expects: `playwright install chromium --with-deps --channel=cft`.
- Create/sync the env: `uv venv --python 3.13 && uv sync`.
- Optional (GPU/olmOCR power users): run `scripts/setup_olmocr_cuda12.sh` to provision CUDAΒ 12.6 + the local vLLM toolchain described in `docs/olmocr_cli_tool_documentation.md`.
2. **Configure environment**
- Copy `.env.example` β `.env`.
- Fill in OCR creds, `API_BASE_URL`, CfT label/build, screenshot style hash overrides, webhook secret, etc.
- Settings are loaded exclusively via `python-decouple` (`app/settings.py`), so keep `.env` private.
3. **Run the API/UI**
- `scripts/dev_run.sh` (defaults to uvicorn with reload). Open `http://localhost:8000` for the HTMX/Alpine interface.
- For production-style smoke, flip to Granian: `SERVER_IMPL=granian UVICORN_RELOAD=false HOST=0.0.0.0 PORT=8000 scripts/dev_run.sh --workers 4 --granian-runtime-threads 2`. This wraps `scripts/run_server.py`, so the same flags work in CI or systemd units.
4. **Trigger a capture**
- UI Run button posts `/jobs`.
- CLI example: `uv run python scripts/mdwb_cli.py fetch https://example.com --watch`
### Persistent Chromium profiles
- The UI profile dropdown and CLI `--profile ` flag reuse login/storage state under `CACHE_ROOT/profiles//storage_state.json`. Pick distinct IDs for red/blue teams or authenticated personas.
- Profiles are recorded in `manifest.profile_id`, surfaced via `/jobs/{id}`/SSE/CLI diagnostics, and stored in `runs.db` so ops can audit which captures used which credentials.
- Storage directories are slugged automatically (`[A-Za-z0-9._-]`), so feel free to pass human-friendly names (e.g., `agent.alpha`).
### Links tab (domain grouping + actions)
- Links now stream into domain-grouped sections so it is easy to scan anchors/forms per host (relative URLs and fragments fall into `(relative)` / `(fragment)` buckets).
- Coverage badges highlight whether a link came from the DOM, OCR, or both, and raise warnings for text mismatches; attribute badges summarize `target`/`rel` metadata, which is useful when triaging overlays or sandbox issues.
- Each row exposes inline actions:
- **Open in new job** populates the toolbar URL field and immediately triggers a capture run.
- **Copy Markdown** copies the Markdown anchor (or best-effort fallback) to the clipboard.
- **Mark crawled** toggles a local badge + dimmed state so agents can keep track of which URLs they have already followed; the selection persists in `localStorage`.
### OCR concurrency autotune
- The OCR client now starts at `OCR_MIN_CONCURRENCY` and automatically scales up toward `OCR_MAX_CONCURRENCY` when latency is healthy, or throttles when responses turn slow/errored. The live Events tab and Manifest view both stream these adjustments so you can see when the controller steps in.
- Manifests (`ocr_autotune`) and CLI commands (`mdwb diag`, `mdwb jobs ocr-metrics`) include the initial/peak/final limits plus a short history of adjustments. Use `MDWB_SERVER_IMPL=granian` + higher worker counts when you want the autotune headroom to matter.
### Cache reuse
- `POST /jobs` now deduplicates captures using a content-address (`url + CfT + viewport + DSF + OCR model + profile`). By default the CLI enables this, so identical requests return immediately with `cache_hit=true` and reuse existing artifacts.
- Disable reuse with `mdwb fetch --no-cache` (or `reuse_cache=false` in the API payload) when you need a fresh capture even if nothing changed.
- Manifests, `/jobs/{id}` snapshots, SSE logs, and `mdwb diag` all expose `cache_hit` so downstream tooling can tell whether a job ran or reused cached output.
## CLI cheatsheet (`scripts/mdwb_cli.py`)
- `fetch [--watch]` β enqueue + optionally stream Markdown as tiles finish (percent/ETA shown unless `--no-progress`; add `--reuse-session` to keep one HTTP/2 client alive across submit + stream).
- `fetch --no-cache` β force a fresh capture even if an identical cache entry exists.
- `fetch --resume [--resume-root path]` β skip URLs already recorded in `done_flags/` (optionally `work_index_list.csv.zst`) under the chosen root; the CLI auto-enables `--watch` so completed jobs write their flag/index entries. Override locations via `--resume-index/--resume-done-dir`.
- `fetch --webhook-url https://... [--webhook-event DONE --webhook-event FAILED]` β register callbacks right after the job is created.
- `show [--ocr-metrics]` β dump the latest job snapshot, optionally with OCR batch/quota telemetry.
- `stream ` β follow the SSE feed.
- `watch ` / `events --follow --since ` β tail the `/jobs/{id}/events` NDJSON log (use `--on EVENT=COMMAND` for hooks; add `--no-progress` to suppress the percent/ETA overlay, `--reuse-session` to keep a single HTTP client). DOM-assist events now print counts/reasons so you immediately see when hybrid recovery patched a tile.
- `diag ` β print CfT/Playwright metadata, capture/OCR timings, warnings, and blocklist hits for incident triage.
- `jobs replay manifest ` β resubmit a stored manifest via `/replay` with validation/JSON output support.
- `jobs embeddings search --vector-file vector.json [--top-k 5]` β search sqlite-vec section embeddings for a run (supports inline `--vector` strings and `--json` output).
- `jobs agents bead-summary ` β convert a markdown checklist into bead-ready summaries (mirrors the intra-agent tracker described in PLAN Β§21).
- `warnings --count 50` β tail `ops/warnings.jsonl` for capture/blocklist incidents.
- `dom links --job-id ` β render the stored `links.json` (anchors/forms/headings/meta).
- `jobs ocr-metrics [--json]` β summarize OCR batch latency, request IDs, and quota usage from the manifest.
- `resume status --root path [--limit 10 --pending --json]` β inspect the resume state; `--pending` shows outstanding URLs, `--json` emits `completed_entries` + `pending_entries` for automation.
- `demo snapshot|stream|events` β exercise the demo endpoints without hitting a live pipeline.
The CLI reads `API_BASE_URL` + `MDWB_API_KEY` from `.env`; override with `--api-base` when targeting staging. For CUDA/vLLM workflows, see `docs/olmocr_cli_tool_documentation.md` and `docs/olmocr_cli_integration.md` for detailed setup + merge notes.
## Agent starter scripts (`scripts/agents/`)
- `uv run python -m scripts.agents.summarize_article summarize --url https://example.com [--out summary.txt]` β submit (or reuse via `--job-id`) and print/save a short summary (defaults to `--reuse-session`).
- `uv run python -m scripts.agents.generate_todos todos --job-id [--json] [--out todos.json]` β extract TODO-style bullets (JSON when `--json`, newline text otherwise); accepts `--url` to run a fresh capture and also defaults to `--reuse-session`.
Both helpers reuse the CLIβs auth + HTTP plumbing, accept the same `--api-base/--http2` flags, fall back to existing jobs when you only need post-processing, and now support `--out` so automations can ingest the results directly.
## Prerequisites & environment
- **Chrome for Testing pin:** Set `CFT_VERSION` + `CFT_LABEL` in `.env` so manifests and ops dashboards stay consistent. Re-run `playwright install` whenever the label/build changes.
- **Transport + viewport:** Defaults (`PLAYWRIGHT_TRANSPORT=cdp`, viewport 1280Γ2000, DPR 2) live in `app/settings.py` and must align with PLAN Β§Β§3, 19.
- **OCR credentials:** `OLMOCR_SERVER`, `OLMOCR_API_KEY`, and `OLMOCR_MODEL` are required unless you point at `OCR_LOCAL_URL`.
- **Warning log + blocklist:** Keep `WARNING_LOG_PATH` and `BLOCKLIST_PATH` writable so scroll/overlay incidents are persisted (`docs/config.md` documents every field).
- **System packages:** Install libvips 8.15+ so the pyvips-based tiler works (`sudo apt-get install libvips` on Debian/Ubuntu, `brew install vips` on macOS). `scripts/run_checks.sh` checks for `pyvips` and fails fast with install instructions unless you explicitly set `SKIP_LIBVIPS_CHECK=1` (for targeted CLI/unit runs on machines without libvips).
## Testing & quality gates
Run these before pushing or shipping capture-facing changes:
```bash
uv run ruff check --fix --unsafe-fixes
uvx ty check
npx playwright test --config=playwright.config.mjs # or PLAYWRIGHT_BIN=/path/to/playwright-test β¦
```
`./scripts/run_checks.sh` wraps the same sequence for CI. Set `PLAYWRIGHT_BIN=/path/to/playwright-test`
if you need to invoke the Node-based runner; otherwise the script prefers `npx playwright test --config=playwright.config.mjs`
(which inherits the defaults from PLAN/AGENTS: viewport 1280Γ2000, DPR 2, reduced motion, light scheme, mask selectors, CDP/BiDi transport via `PLAYWRIGHT_TRANSPORT`). When Node Playwright isnβt installed it falls back to `uv run playwright test` and prints a warning if the Python CLI lacks `test`.
When you already know libvips isnβt available in a minimal container, export `SKIP_LIBVIPS_CHECK=1` to bypass the preflight warning. Optional toggles inside `scripts/run_checks.sh`:
- `MDWB_CHECK_METRICS=1` (optionally `CHECK_METRICS_TIMEOUT=`) appends the Prometheus health check after pytest/Playwright.
- `MDWB_RUN_E2E=1` runs the lightweight placeholder suite in `tests/test_e2e_small.py` so CI can keep a fast E2E sentinel without invoking FlowLogger.
- `MDWB_RUN_E2E_RICH=1` runs the full FlowLogger scenarios in `tests/test_e2e_cli.py`; transcript artifacts are copied to `tmp/rich_e2e_cli/` (override via `RICH_E2E_ARTIFACT_DIR=/path/to/dir`) so operators can review the panels/tables/progress output without hunting through pytest temp dirs.
- `MDWB_RUN_E2E_GENERATED=1` runs the generative guardrail suite (`tests/test_e2e_generated.py`). Point `MDWB_GENERATED_E2E_CASES=/path/to/cases.json` at a bespoke cases file when you need to refresh or extend the Markdown baselines.
Grab the resulting `tmp/rich_e2e_cli/*.log|*.html` files in CI for postmortems.
- The bundled pytest targets now include the store/manifest persistence suite (`tests/test_store_manifest.py`, `tests/test_manifest_contract.py`),
the Prometheus CLI health checks (`tests/test_check_metrics.py`), and the ops regressions for `show_latest_smoke`/`update_smoke_pointers` in addition
to the CLI coverage. This keeps RunRecord fields, smoke pointer tooling, and metrics hooks under CI without needing a live API server.
- Playwright defaults to the Chrome for Testing build. Leave `PLAYWRIGHT_CHANNEL` unset (or set it to `cft`) so local smoke runs match the capture pipeline; if you have to fall back to stock Chromium, set `PLAYWRIGHT_CHANNEL=chromium` or use a comma-separated preference such as `PLAYWRIGHT_CHANNEL="chromium,cft"`. Likewise, keep `PLAYWRIGHT_TRANSPORT=cdp` unless you are explicitly exercising WebDriver BiDiβwhen you do, a value like `PLAYWRIGHT_TRANSPORT="bidi,cdp"` makes the preferred/fallback order obvious to anyone reading CI metadata.
- Every `run_checks` invocation now emits `tmp/pytest_report.xml` and `tmp/pytest_summary.json`
(override with `PYTEST_JUNIT_PATH`/`PYTEST_SUMMARY_PATH`). The JSON digest lists totals and the first few
failing test names, so CI/Agent Mail can quote failures without re-running pytest.
Also run `uv run python scripts/check_env.py` whenever `.env` changesβCI and nightly smokes depend on it to confirm CfT pins + OCR secrets.
Additional expectations (per PLAN Β§Β§14, 19.10, 22):
- Keep nightly smokes green via `uv run python scripts/run_smoke.py --date $(date -u +%Y-%m-%d)`.
- Refresh `benchmarks/production/weekly_summary.json` (generated automatically by the smoke script) for Monday ops reports.
- Run `uv run python scripts/check_metrics.py --check-weekly` (with the default `benchmarks/production/weekly_summary.json`) before handoff so we fail fast when capture/OCR SLO p99 values exceed their 2Γp95 budgets.
- Tail `ops/warnings.jsonl` or `mdwb warnings` for canvas/video/overlay spikes.
## Day-to-day workflow
- **Reserve + communicate:** Before editing, reserve files and announce the pickup via Agent Mail (cite the bead id). Keep PLAN sections annotated with `_Status β ` entries so the written record matches reality.
- **Track via beads:** Use `bd list/show` to pick the next unblocked issue, add comments for status updates, and close with findings/tests noted.
- **Run the required checks:** `ruff`, `ty`, Playwright smoke, `scripts/check_env.py`, plus any bead-specific tests (e.g., sqlite-vec search or CLI watch). Never skip the capture smoke after touching Playwright/OCR code.
- **Sync docs:** README, PLAN, `docs/config.md`, and `docs/ops.md` must stay consistent; update them alongside code changes so ops can trust the written guidance.
- **Ops handoff:** For capture/OCR fixes, capture job ids + manifest paths in your bead comment and Mail thread so others can reproduce issues quickly.
## Operations & automation
- `scripts/run_smoke.py` β nightly URL set capture + manifest/latency aggregation.
- `scripts/show_latest_smoke.py` β quick pointers to the latest smoke outputs; manifest rows now include overlap ratios, validation failure counts, and seam marker/hash counts so regressions stand out. The `--weekly` view prints seam marker percentiles plus capture/OCR SLO status (p99 vs 2Γp95) using the data generated by the nightly smoke script, and `--slo` renders the aggregated `latest_slo_summary.json` table (counts, p50/p95 capture/OCR/total, budget breaches). It now fails fast when `latest.txt` exists but is empty, so rerun `scripts/update_smoke_pointers.py ` whenever the pointer guard triggers.
- `scripts/olmocr_cli.py` + `docs/olmocr_cli.md` β hosted olmOCR orchestration/diagnostics.
- `scripts/analyze_stitch.py` β lightweight research helper that reads a manifest index and reports seam counts/hash diversity plus hyphen-break dom-assist incidents per run (optionally `--json`). Handy for bd-0jc overlap/SSIM and hyphen guard experiments.
- Weekly seam telemetry β run `uv run python scripts/show_latest_smoke.py --weekly --json` (or parse `benchmarks/production/weekly_summary.json`) to pull `seam_markers.count`/`hashes`/`events` p50/p95 for every category. Feed those numbers straight into Grafana/Prometheus so seam regressions (or fallback spikes) page operators alongside capture/OCR SLO breaches, and archive `weekly_slo.json`/`.prom` for the rolling capture/OCR SLO window.
- `mdwb jobs replay manifest ` β re-run a job with a stored manifest via `POST /replay` (accepts `--api-base`, `--http2`, `--json`); keep `scripts/replay_job.sh` around for legacy automation until everything points at the CLI.
- `mdwb jobs show ` β inspect the latest snapshot plus sweep stats/validation issues in one table. When manifests are missing (cached jobs, trimmed SSE payloads), the CLI still prints stored seam counts (`Seam markers: X (unique hashes: Y)`) so you can spot duplicate sweeps without spelunking manifests. `mdwb diag --ocr-metrics` shows the detailed seam marker table when manifests are available.
- `scripts/update_smoke_pointers.py [--root path]` β refresh `latest_summary.md`, `latest_manifest_index.json`, and `latest_metrics.json` after ad-hoc smoke runs so dashboards point at the right data (defaults to `MDWB_SMOKE_ROOT` unless `--root` is provided; add `--weekly-source` when overriding the rolling summary). The command now computes `latest_slo_summary.json` by default using the manifest index + PLAN Β§22 budget file; pass `--no-compute-slo` to skip or `--budget-file` to point at an alternate budget definition.
- `scripts/check_metrics.py` β ping `/metrics` plus the exporter; supports `--api-base`, `--exporter-url`, `--json`, and now `--check-weekly` (validates `benchmarks/production/weekly_summary.json` so release builds fail fast if the rolling SLOs are blown). When you pass `--check-weekly --json`, the CLI always emits a `weekly` block (status, `summary_path`, `failures`) even if the summary file is missing/unreadable, which makes automation logs self-explanatory. `scripts/prom_scrape_check.py` remains as a compatibility wrapper but simply re-exports the same Typer CLI.
- `scripts/compute_slo.py` β consumes the latest `latest_manifest_index.json` (or any manifest index) plus the benchmark budget file to produce capture/OCR SLO summaries. The CLI writes a JSON report (`--out benchmarks/production/latest_slo_summary.json`) and optionally emits Prometheus textfile metrics via `--prom-output tmp/mdwb_slo.prom`, enabling dashboards/alerts to track per-category p95 totals, breach ratios, and overall SLO status. `scripts/run_smoke.py` invokes this automatically after each smoke run so `benchmarks/production/latest_slo_summary.json` and `latest_slo.prom` stay fresh; rerun manually when you need ad-hoc SLO snapshots.
- Prometheus metrics now cover capture/OCR/stitch durations, warning/blocklist counts, job completions, and SSE heartbeats via `prometheus-fastapi-instrumentator`. Scrape `/metrics` on the API port or hit the background exporter on `PROMETHEUS_PORT` (default 9000); docs/ops.md lists the metric names + alert hooks.
- Set `MDWB_CHECK_METRICS=1` (optionally `CHECK_METRICS_TIMEOUT=`) when running `scripts/run_checks.sh` to include the Prometheus smoke (`scripts/check_metrics.py`) alongside the usual lint/type/pytest/Playwright stack.
### Handy commands
```bash
# Validate env
uv run python scripts/check_env.py
# Run cli demo job
uv run python scripts/mdwb_cli.py demo stream
# Replay an existing manifest
uv run python scripts/mdwb_cli.py jobs replay manifest cache/example.com/.../manifest.json
# Search embeddings for a run (vector as JSON array)
uv run python scripts/mdwb_cli.py jobs embeddings search JOB_ID --vector "[0.12, 0.04, ...]" --top-k 3
# Tail warning log via CLI
uv run python scripts/mdwb_cli.py warnings --count 25
# Download a job's tar bundle (tiles + markdown + manifest)
uv run python scripts/mdwb_cli.py jobs bundle --out path/to/bundle.tar.zst
# Run nightly smoke for docs/articles only (dry run)
uv run python scripts/run_smoke.py --date $(date -u +%Y-%m-%d) --category docs_articles --dry-run
```
## Artifacts you should expect per job
- `artifact/tiles/tile_*.png` β viewport-sweep tiles (β€1288 px long side) with overlap + SHA metadata.
- `/jobs/{id}/artifact/highlight?tile=β¦&y0=β¦&y1=β¦` β quick HTML viewer that overlays the region referenced by each provenance comment (handy for code reviews and incident reports).
- `out.md` β final Markdown with DOM-guided heading normalization plus provenance comments (``) and Links Appendix.
- `links.json` β anchors/forms/headings/meta harvested from the DOM snapshot.
- `manifest.json` β CfT label/build, Playwright version, screenshot style hash, warnings, sweep stats, timings, and the new `seam_marker_events` list whenever seam hints were required to align tiles.
- `dom_snapshot.html` β raw DOM capture used for link diffs and hybrid recovery (when enabled).
- `bundle.tar.zst` β optional tarball for incidents/export (`Store.build_bundle`).
- Markdown output now includes seam markers (``) and enriched provenance comments (`viewport_y`, `overlap_px`, highlight links) plus detailed `` breadcrumbs so reviewers can jump straight to stitched regions.
Use `mdwb jobs bundle β¦` or `mdwb jobs artifacts manifest β¦` (or `/jobs/{id}/artifact/...`) to reproduce a job locally and fetch its artifacts for debugging.
## Communication & task tracking
- **Beads** (`bd ...`) track every feature/bug (map bead IDs to Plan sections in Agent Mail threads).
- **Agent Mail** (MCP) is the coordination channelβreserve files before editing, summarize work in the relevant bead thread, and note Plan updates inline (_see Β§Β§10β11 for example status notes_).
## Further reading
- `AGENTS.md` β ground rules (no destructive git cmds, uv usage, capture policies).
- `PLAN_TO_IMPLEMENT_MARKDOWN_WEB_BROWSER_PROJECT.md` β canonical spec + incremental upgrades.
- `docs/architecture.md` β best practices + data flow diagrams.
- `docs/blocklist.md`, `docs/config.md`, `docs/models.yaml`, `docs/ops.md`, `docs/olmocr_cli.md` β supporting specs.
- `docs/release_checklist.md` β step-by-step release & regression runbook (CfT/Playwright/model toggles, smoke commands, artifact list).
## β FAQ & Troubleshooting
### Common Issues
**Q: Why is OCR slow/failing?**
```bash
# Check OCR quota and performance
mdwb jobs ocr-metrics job123
mdwb warnings --count 20 | grep -i "ocr\|quota"
# Reduce concurrency if hitting rate limits
export OCR_MAX_CONCURRENCY=5
```
**A: Common causes:**
- OCR API rate limiting (reduce `OCR_MAX_CONCURRENCY`)
- Network latency to OCR service (consider local olmOCR)
- Complex images requiring more processing time
**Q: Poor OCR accuracy on my content?**
```bash
# Check image quality in tiles
ls cache/your-site.com/job123/artifact/tiles/
# View highlight links for problematic sections
curl "http://localhost:8000/jobs/job123/artifact/highlight?tile=5&y0=100&y1=200"
```
**A: Optimization strategies:**
- Increase viewport size for better text rendering
- Use authenticated profiles for login-walled content
- Check for overlay/popup interference in manifest warnings
**Q: Missing content compared to browser view?**
```bash
# Check DOM snapshot vs final markdown
curl "http://localhost:8000/jobs/job123/artifact/dom_snapshot.html"
mdwb dom links --job-id job123
```
**A: Common causes:**
- JavaScript-heavy SPAs (content loads after initial render)
- Authentication required (use `--profile` with logged-in state)
- Overlays/popups blocking content (check blocklist configuration)
### Performance Optimization
**Slow jobs:**
1. **Check tile count**: `mdwb diag job123` - High tile counts increase OCR time
2. **Review warnings**: `mdwb warnings` - Canvas/scroll issues affect performance
3. **Monitor concurrency**: OCR auto-tune may be throttling due to latency
**Memory issues:**
1. **Large pages**: Set `TILE_MAX_SIZE` lower to reduce memory per tile
2. **Concurrent jobs**: Limit active jobs with `MAX_ACTIVE_JOBS`
3. **Cache cleanup**: Implement retention policy for old artifacts
**Network problems:**
1. **OCR connectivity**: Test with `curl ${OLMOCR_SERVER}/health`
2. **Firewall issues**: Ensure outbound HTTPS access
3. **Proxy configuration**: Set HTTP_PROXY/HTTPS_PROXY if needed
### Error Code Reference
| Error Code | Meaning | Solution |
|------------|---------|----------|
| `OCR_QUOTA_EXCEEDED` | Hit API rate limits | Wait or increase quota |
| `SCREENSHOT_TIMEOUT` | Page load too slow | Increase timeout, check URL |
| `TILE_PROCESSING_FAILED` | Image processing error | Check libvips installation |
| `MANIFEST_VALIDATION_FAILED` | Corrupt job state | Restart job, check disk space |
| `DOM_SNAPSHOT_FAILED` | Can't save DOM | Check write permissions |