{"id":50897716,"url":"https://github.com/Dicklesworthstone/markdown_web_browser","last_synced_at":"2026-07-03T16:01:20.172Z","repository":{"id":323049737,"uuid":"1091900863","full_name":"Dicklesworthstone/markdown_web_browser","owner":"Dicklesworthstone","description":"Renders any URL via headless Chrome, tiles screenshots into OCR slices, and streams structured Markdown + provenance back to AI agents and pipelines","archived":false,"fork":false,"pushed_at":"2026-07-02T02:16:58.000Z","size":6937,"stargazers_count":170,"open_issues_count":0,"forks_count":27,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-07-02T04:15:25.742Z","etag":null,"topics":["ai-agents","markdown","ocr","python","web-scraping"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Dicklesworthstone.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-11-07T17:30:25.000Z","updated_at":"2026-07-02T02:16:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"fdaa1620-e907-45f7-82b4-7e79db8e8b52","html_url":"https://github.com/Dicklesworthstone/markdown_web_browser","commit_stats":null,"previous_names":["dicklesworthstone/markdown_web_browser"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Dicklesworthstone/markdown_web_browser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dicklesworthstone%2Fmarkdown_web_browser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dicklesworthstone%2Fmarkdown_web_browser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dicklesworthstone%2Fmarkdown_web_browser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dicklesworthstone%2Fmarkdown_web_browser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Dicklesworthstone","download_url":"https://codeload.github.com/Dicklesworthstone/markdown_web_browser/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dicklesworthstone%2Fmarkdown_web_browser/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":35092185,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-03T02:00:05.635Z","response_time":110,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","markdown","ocr","python","web-scraping"],"created_at":"2026-06-16T01:31:30.076Z","updated_at":"2026-07-03T16:01:20.164Z","avatar_url":"https://github.com/Dicklesworthstone.png","language":"Python","funding_links":[],"categories":["ai-agents"],"sub_categories":[],"readme":"# Markdown Web Browser 🌐→📝\n\n\u003e **Transform any website into clean, proveable Markdown with full OCR accuracy**\n\nRender any URL with deterministic Chrome-for-Testing, tile screenshots into OCR-friendly slices, and stream structured Markdown + provenance back to AI agents, web apps, and automation pipelines.\n\n**Two ways to use it:**\n1. **🌐 Browser UI** (`/browser`) - Interactive web browsing with navigation history, address bar, and dual markdown views\n2. **⚙️ CLI + API** - Programmatic capture for automation, batch processing, and agent workflows\n\n## 🎯 Example: finviz.com (Financial Market Dashboard)\n\n**The Challenge:** Finviz.com is protected by **Cloudflare bot detection** that blocks traditional web scrapers. Our system bypasses this with comprehensive stealth techniques.\n\n### BEFORE: Original Website\n\n\u003cimg src=\"docs/images/finviz_before.png\" alt=\"finviz.com original website\" width=\"800\"\u003e\n\n### AFTER: Clean Markdown (Two Views)\n\n**Rendered Markdown** - Beautiful GitHub-styled formatting:\n\n\u003cimg src=\"docs/images/finviz_after_rendered.png\" alt=\"Rendered markdown view in browser UI\" width=\"800\"\u003e\n\n**Raw Markdown** - Syntax-highlighted source with full provenance:\n\n\u003cimg src=\"docs/images/finviz_after_raw.png\" alt=\"Raw markdown view with syntax highlighting\" width=\"800\"\u003e\n\n### Sample Output (truncated - actual output is 223 lines)\n\n```markdown\n\u003c!-- source: tile_0000, y=0, height=1288, sha256=557720698e6ee5e6474e69abc8305307d9e080198ab89cdccb0f7cfbe5e176dc, scale=0.50, viewport_y=0, overlap_px=120, path=artifact/tiles/tile_0000.png, highlight=/jobs/690fec5fca24499c901305d38bc85b6f/artifact/highlight?tile=artifact%2Ftiles%2Ftile_0000.png\u0026y0=0\u0026y1=1288 --\u003e\n\nThis screenshot from Finviz shows a financial visualization dashboard with various stock market indices and tickers, as well as a color-coded sector heatmap. Here's a breakdown of the main sections:\n\n1. **Indices and Charts:**\n   - **DOW:** Nov 7, +74.80 (0.16%), 46987.1\n   - **NASDAQ:** Nov 7, -49.46 (0.21%), 23004.5\n   - **S\u0026P 500:** Nov 7, +8.48 (0.13%), 6728.80\n\n2. **Advancing vs Declining Stocks:**\n   - Advancing: 56.0% (3116)\n   - Declining: 40.5% (2254)\n   - New High: 19.5% (110)\n   - New Low: 19.5% (110)\n\n3. **Top Gainers and Top Losers:**\n   - Top Gainers:\n     - MSGM: 70.78%\n     - BKYI: 51.57%\n     - GIFJ: 49.68%\n     - ORGO: 44.73%\n   - Top Losers:\n     - DTCK: -77.93%\n     - ENGS: -54.55%\n     - ELDN: -49.76%\n     - MEHA: -46.93%\n\n4. **Sector Heatmap:**\n   - Shows color-coded sectors such as Technology, Consumer Cyclical, Communication Services, Industrials, etc.\n\n5. **Headlines:**\n   - 05:15PM What private data says about America's job engine\n   - 03:39PM The 'buy everything' rally now feels like an uphill battle, putting bull market to the test\n   - Nov-07 Stock Market News, Nov. 7, 2025: Nasdaq Has Its Worst Week Since April\n\n\u003c!-- source: tile_0001, y=1440, height=1288, sha256=9a04a7f422964951f8b411e11790ca476389c777614d5085e05008b750eb90bf, scale=0.50, viewport_y=0, overlap_px=120, path=artifact/tiles/tile_0001.png, highlight=/jobs/690fec5fca24499c901305d38bc85b6f/artifact/highlight?tile=artifact%2Ftiles%2Ftile_0001.png\u0026y0=0\u0026y1=1288 --\u003e\n\n### Insider Trading:\n  - OSIS: Morben Paul Keith (PRES., OPTOELEC) sold 416 shares at $279.10, valued at $116,106.\n  - OSIS: HAWKINS JAMES B (Director) sold 1,500 shares at $283.15, valued at $424,725.\n  - EL: Leonard A. Lauder 20 (10% Owner) sold 2,786,040 shares at $89.70, valued at $249,907,788.\n\n### Futures Prices:\n  - Crude Oil: Last 59.78, Change +0.03 (+0.05%)\n  - Natural Gas: Last 4.4530, Change -0.1380 (+3.20%)\n  - Gold: Last 4019.50, Change +9.70 (+0.24%)\n  - Dow: Last 47250.00, Change +165.00 (+0.35%)\n  - S\u0026P 500: Last 6786.50, Change +32.75 (+0.48%)\n  - Nasdaq 100: Last 25345.50, Change +179.25 (+0.71%)\n\n\u003c!-- ... 2 more tiles with earnings releases, forex data, and full market overview ... --\u003e\n```\n\n### What makes this work:\n- ✅ **Bypasses Cloudflare bot detection** - Chrome's `--headless=new` mode (undetectable) + 60+ lines of stealth JavaScript\n- ✅ **Comprehensive fingerprint masking** - navigator.webdriver, plugins, permissions API, hardware specs\n- ✅ **Captures 4 tiles** with overlapping regions for seamless stitching\n- ✅ **95%+ OCR accuracy** - Extracts all stock tickers, prices, and percentages\n- ✅ **Full provenance** - Every section links back to exact pixel coordinates\n- ✅ **Works on protected sites** - finviz.com (Cloudflare), financial dashboards, SPAs\n\n## 🌐 Browser UI - Interactive Web Browsing\n\n**NEW:** A complete browser-like interface for viewing web pages as clean, readable markdown in real-time.\n\n### Access the Browser\n\nOnce installed, navigate to:\n```\nhttp://localhost:8000/browser\n```\n\n### Features\n\n- **🔍 Smart Address Bar**: Enter any URL or search term (auto-detects and searches Google)\n- **⬅️➡️ Navigation History**: Back/forward buttons with full browsing history\n- **🔄 Refresh**: Force reload to bypass cache\n- **👁️ Dual View Modes**:\n  - **Rendered**: Beautiful GitHub-styled markdown with proper formatting\n  - **Raw**: Syntax-highlighted markdown source (Prism.js)\n- **⚡ Smart Caching**: Pages cached for 1 hour for instant repeat visits\n- **📊 Real-time Progress**: Live tile processing updates with progress bars\n- **⌨️ Keyboard Shortcuts**:\n  - `Alt+Left/Right` - Navigate back/forward\n  - `Ctrl+R` - Refresh page\n  - `Ctrl+U` - Toggle rendered/raw view\n  - `Ctrl+L` - Focus address bar\n\n### How It Works\n\n1. Enter a URL (e.g., `https://news.ycombinator.com`) or search term\n2. Backend captures page as tiled screenshots via headless Chrome\n3. OCR extracts text from each tile with 95%+ accuracy\n4. Markdown is stitched together with provenance tracking\n5. View rendered markdown OR syntax-highlighted raw source\n6. Navigate back/forward like a real browser\n\n**Perfect for:**\n- Reading web content without distractions\n- AI agents browsing websites without vision models\n- Archiving web pages as clean markdown\n- Research with full history and navigation\n\n📖 **Full documentation**: See `docs/BROWSER_UI.md` for detailed features, keyboard shortcuts, and troubleshooting.\n\n---\n\n## 🚀 Why This Matters\n\n### 🤖 Perfect for AI Agents\n- **Deterministic output**: Same input = same markdown every time\n- **Verifiable provenance**: Every sentence links back to exact pixel coordinates\n- **Rich metadata**: Links, headings, tables extracted from both DOM and visuals\n- **OCR + DOM fusion**: Catches content missed by traditional scrapers\n\n### 📊 vs. Traditional Solutions\n\n| Method | Visual Accuracy | Provenance | Deterministic | Complex Layouts |\n|--------|----------------|-------------|---------------|-----------------|\n| **Markdown Web Browser** | ✅ 95%+ | ✅ Pixel-level | ✅ Chrome-for-Testing | ✅ OCR + DOM |\n| Puppeteer + Readability | ❌ 60% | ❌ None | ⚠️ Browser variance | ❌ DOM-only |\n| BeautifulSoup | ❌ 40% | ❌ None | ✅ Yes | ❌ No visuals |\n| Selenium screenshots | ✅ 90% | ❌ None | ❌ Driver variance | ⚠️ Manual OCR |\n\n### 🎯 Real-World Use Cases\n\n**AI Research \u0026 Analysis**\n- Process 10,000+ financial reports/day with 95% accuracy\n- Extract data from PDFs, SPAs, and interactive dashboards\n- Archive regulatory filings with full audit trails\n\n**Content Intelligence**\n- Monitor competitor websites with pixel-perfect change detection\n- Extract structured data from news sites, forums, and social platforms\n- Generate documentation from live web applications\n\n**Compliance \u0026 Legal**\n- Create admissible evidence with cryptographic provenance\n- Archive website states for regulatory submissions\n- Track website changes with timestamped, verifiable records\n\n## 🚀 Quick Install (One Command)\n\nGet started in under 2 minutes with our automated installer:\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/markdown_web_browser/main/install.sh | bash -s -- --yes\n```\n\n**What this installer does for you:**\n\n1. **Checks your system** - Detects your OS (Ubuntu/Debian, macOS, RHEL, Arch)\n2. **Installs uv package manager** - The modern Python package manager from Astral\n3. **Installs system dependencies** - Automatically installs libvips (image processing library)\n4. **Clones the repository** - Downloads the latest Markdown Web Browser code\n5. **Sets up Python 3.13 environment** - Creates isolated virtual environment with all dependencies\n6. **Installs Playwright browsers** - Downloads Chrome for Testing with bot detection evasion built-in\n7. **Configures environment** - Sets up `.env` file with default settings\n8. **Runs verification tests** - Ensures everything is working correctly\n9. **Creates launcher script** - Provides a convenient `mdwb` command for CLI usage\n\nFor interactive installation or custom options:\n```bash\n# Interactive mode (prompts for each step)\ncurl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/markdown_web_browser/main/install.sh | bash\n\n# Custom directory with OCR API key\ncurl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/markdown_web_browser/main/install.sh | bash -s -- \\\n  --dir=/opt/mdwb --ocr-key=sk-YOUR-API-KEY\n\n# See all options\ncurl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/markdown_web_browser/main/install.sh | bash -s -- --help\n```\n\n## Why it exists\n- **Screenshot-first:** Captures exactly what users see—no PDF/print CSS surprises.\n- **Deterministic + auditable:** Every run emits tiles, `out.md`, `links.json`, and `manifest.json` (with CfT label/build, Playwright version, screenshot style hash, warnings, and timings).\n- **Agent-friendly extras:** DOM-derived `links.json`, sqlite-vec embeddings, SSE/NDJSON feeds, and CLI helpers so builders can consume Markdown immediately.\n- **Ops-ready:** Python 3.13 + FastAPI + Playwright with uv packaging, structured settings via `python-decouple`, telemetry hooks, and smoke/latency automation.\n\n## Architecture at a glance\n1. **User Interface Layer**:\n   - Browser UI (`/browser`) - Interactive browsing with history, view toggling, and real-time progress\n   - Job Dashboard (`/`) - HTMX-based monitoring with SSE live updates\n   - CLI + API - Programmatic access for automation and agents\n2. FastAPI `/jobs` endpoint enqueues a capture via the `JobManager`.\n3. Playwright (Chromium CfT, viewport 1280×2000, DPR 2, reduced motion) performs a deterministic viewport sweep.\n4. `pyvips` slices sweeps into ≤1288 px tiles with ≈120 px overlap; each tile carries offsets, DPR, hashes.\n5. The OCR client submits tiles (HTTP/2) to hosted or local olmOCR, with retries + concurrency autotune.\n6. Stitcher merges Markdown, aligns headings with the DOM outline, trims overlaps via SSIM + fuzzy text comparisons, injects provenance comments (with tile metadata + highlight links), and builds the Links Appendix.\n7. `Store` writes artifacts under a content-addressed path and updates sqlite + sqlite-vec metadata for embeddings search.\n8. `/jobs/{id}`, `/jobs/{id}/stream`, `/jobs/{id}/events`, `/jobs/{id}/links.json`, etc., feed all UI layers (Browser UI, Dashboard, CLI) with consistent data.\n9. The job dashboard relies on the HTMX SSE extension for real-time updates (state, manifest, warning pills), while the Browser UI uses JavaScript polling for simplicity.\n\nSee `PLAN_TO_IMPLEMENT_MARKDOWN_WEB_BROWSER_PROJECT.md` §§2–5, 19 for the full breakdown.\n\n## 🎯 Your First Successful Capture (5 minutes)\n\n### Option A: Browser UI (Easiest)\n\n1. **Start the server:**\n   ```bash\n   mdwb demo stream\n   ```\n\n2. **Open your browser:**\n   ```\n   http://localhost:8000/browser\n   ```\n\n3. **Enter a URL or search term:**\n   - Try: `https://example.com` or just search `markdown tutorial`\n   - Watch the progress bar as tiles are processed\n   - Toggle between rendered and raw markdown views\n\n**✅ Success indicators:**\n- Clean markdown appears in ~30-60 seconds\n- Navigation buttons work (back/forward)\n- Toggle switches between rendered/raw views\n\n### Option B: CLI (For Automation)\n\n**Step 1: Verify Setup**\n```bash\n# Test the installation\nmdwb demo stream\n```\n**✅ Success indicators:**\n- Fake job runs with progress bars\n- No import or dependency errors\n- Server responds on localhost:8000\n\n**Tip:** Any `mdwb` CLI command that supports `--json` also accepts `--format toon` for TOON output (falls back to JSON if `tru` is unavailable).\n\n**Step 2: Capture a Real Page**\n```bash\n# Start with a simple page\nmdwb fetch https://example.com --watch\n```\n\n**✅ What you should see:**\n```\n🔄 Job abc123 submitted successfully\n📸 Screenshots: ████████████ 100% (2/2 tiles)\n🔤 OCR Processing: ████████████ 100% (completed in 12.4s)\n🧵 Stitching: ████████████ 100% (completed in 0.3s)\n✅ Job completed successfully in 15.2s\n\n📄 Markdown saved to: /cache/example.com/abc123/out.md\n🔗 Links extracted: /cache/example.com/abc123/links.json\n📊 Full manifest: /cache/example.com/abc123/manifest.json\n```\n\n**Step 3: Validate Output Quality**\n```bash\n# Check the generated markdown\ncat /cache/example.com/abc123/out.md\n\n# Verify provenance comments are included\ngrep \"source: tile_\" /cache/example.com/abc123/out.md\n\n# Check extracted links\ncat /cache/example.com/abc123/links.json | jq '.anchors | length'\n```\n\n**✅ Quality indicators:**\n- Markdown contains readable text (not OCR gibberish)\n- Provenance comments show `\u003c!-- source: tile_X --\u003e`\n- Links.json contains discovered anchors\n- No \"ERROR\" or \"FAILED\" in manifest.json\n\n**Step 4: Try a Complex Page with Bot Detection**\n```bash\n# Test with a real-world site that has Cloudflare protection\nmdwb fetch https://finviz.com --watch\n```\n\n---\n\n## Manual Installation\n1. **Install prerequisites**\n   - Python 3.13, uv ≥0.8, and the system deps Playwright requires.\n   - Install system dependencies: `sudo apt-get install libvips-dev` (Ubuntu/Debian) or `brew install vips` (macOS).\n   - Install the CfT build Playwright expects: `playwright install chromium --with-deps --channel=cft`.\n   - Create/sync the env: `uv venv --python 3.13 \u0026\u0026 uv sync`.\n   - Optional (GPU/olmOCR power users): run `scripts/setup_olmocr_cuda12.sh` to provision CUDA 12.6 + the local vLLM toolchain described in `docs/olmocr_cli_tool_documentation.md`.\n2. **Configure environment**\n   - Copy `.env.example` → `.env`.\n   - Fill in OCR creds, `API_BASE_URL`, CfT label/build, screenshot style hash overrides, webhook secret, etc.\n   - Settings are loaded exclusively via `python-decouple` (`app/settings.py`), so keep `.env` private.\n3. **Run the API/UI**\n   - `scripts/dev_run.sh` (defaults to uvicorn with reload). Open `http://localhost:8000` for the HTMX/Alpine interface.\n   - For production-style smoke, flip to Granian: `SERVER_IMPL=granian UVICORN_RELOAD=false HOST=0.0.0.0 PORT=8000 scripts/dev_run.sh --workers 4 --granian-runtime-threads 2`. This wraps `scripts/run_server.py`, so the same flags work in CI or systemd units.\n4. **Trigger a capture**\n   - UI Run button posts `/jobs`.\n   - CLI example: `uv run python scripts/mdwb_cli.py fetch https://example.com --watch`\n\n### Persistent Chromium profiles\n- The UI profile dropdown and CLI `--profile \u003cid\u003e` flag reuse login/storage state under `CACHE_ROOT/profiles/\u003cid\u003e/storage_state.json`. Pick distinct IDs for red/blue teams or authenticated personas.\n- Profiles are recorded in `manifest.profile_id`, surfaced via `/jobs/{id}`/SSE/CLI diagnostics, and stored in `runs.db` so ops can audit which captures used which credentials.\n- Storage directories are slugged automatically (`[A-Za-z0-9._-]`), so feel free to pass human-friendly names (e.g., `agent.alpha`).\n\n### Links tab (domain grouping + actions)\n- Links now stream into domain-grouped sections so it is easy to scan anchors/forms per host (relative URLs and fragments fall into `(relative)` / `(fragment)` buckets).\n- Coverage badges highlight whether a link came from the DOM, OCR, or both, and raise warnings for text mismatches; attribute badges summarize `target`/`rel` metadata, which is useful when triaging overlays or sandbox issues.\n- Each row exposes inline actions:\n  - **Open in new job** populates the toolbar URL field and immediately triggers a capture run.\n  - **Copy Markdown** copies the Markdown anchor (or best-effort fallback) to the clipboard.\n  - **Mark crawled** toggles a local badge + dimmed state so agents can keep track of which URLs they have already followed; the selection persists in `localStorage`.\n\n### OCR concurrency autotune\n- The OCR client now starts at `OCR_MIN_CONCURRENCY` and automatically scales up toward `OCR_MAX_CONCURRENCY` when latency is healthy, or throttles when responses turn slow/errored. The live Events tab and Manifest view both stream these adjustments so you can see when the controller steps in.\n- Manifests (`ocr_autotune`) and CLI commands (`mdwb diag`, `mdwb jobs ocr-metrics`) include the initial/peak/final limits plus a short history of adjustments. Use `MDWB_SERVER_IMPL=granian` + higher worker counts when you want the autotune headroom to matter.\n\n### Cache reuse\n- `POST /jobs` now deduplicates captures using a content-address (`url + CfT + viewport + DSF + OCR model + profile`). By default the CLI enables this, so identical requests return immediately with `cache_hit=true` and reuse existing artifacts.\n- Disable reuse with `mdwb fetch --no-cache` (or `reuse_cache=false` in the API payload) when you need a fresh capture even if nothing changed.\n- Manifests, `/jobs/{id}` snapshots, SSE logs, and `mdwb diag` all expose `cache_hit` so downstream tooling can tell whether a job ran or reused cached output.\n\n## CLI cheatsheet (`scripts/mdwb_cli.py`)\n- `fetch \u003curl\u003e [--watch]` — enqueue + optionally stream Markdown as tiles finish (percent/ETA shown unless `--no-progress`; add `--reuse-session` to keep one HTTP/2 client alive across submit + stream).\n- `fetch \u003curl\u003e --no-cache` — force a fresh capture even if an identical cache entry exists.\n- `fetch \u003curl\u003e --resume [--resume-root path]` — skip URLs already recorded in `done_flags/` (optionally `work_index_list.csv.zst`) under the chosen root; the CLI auto-enables `--watch` so completed jobs write their flag/index entries. Override locations via `--resume-index/--resume-done-dir`.\n- `fetch \u003curl\u003e --webhook-url https://... [--webhook-event DONE --webhook-event FAILED]` — register callbacks right after the job is created.\n- `show \u003cjob-id\u003e [--ocr-metrics]` — dump the latest job snapshot, optionally with OCR batch/quota telemetry.\n- `stream \u003cjob-id\u003e` — follow the SSE feed.\n- `watch \u003cjob-id\u003e` / `events \u003cjob-id\u003e --follow --since \u003cISO\u003e` — tail the `/jobs/{id}/events` NDJSON log (use `--on EVENT=COMMAND` for hooks; add `--no-progress` to suppress the percent/ETA overlay, `--reuse-session` to keep a single HTTP client). DOM-assist events now print counts/reasons so you immediately see when hybrid recovery patched a tile.\n- `diag \u003cjob-id\u003e` — print CfT/Playwright metadata, capture/OCR timings, warnings, and blocklist hits for incident triage.\n- `jobs replay manifest \u003cmanifest.json\u003e` — resubmit a stored manifest via `/replay` with validation/JSON output support.\n- `jobs embeddings search \u003cjob-id\u003e --vector-file vector.json [--top-k 5]` — search sqlite-vec section embeddings for a run (supports inline `--vector` strings and `--json` output).\n- `jobs agents bead-summary \u003cplan.md\u003e` — convert a markdown checklist into bead-ready summaries (mirrors the intra-agent tracker described in PLAN §21).\n- `warnings --count 50` — tail `ops/warnings.jsonl` for capture/blocklist incidents.\n- `dom links --job-id \u003cid\u003e` — render the stored `links.json` (anchors/forms/headings/meta).\n- `jobs ocr-metrics \u003cjob-id\u003e [--json]` — summarize OCR batch latency, request IDs, and quota usage from the manifest.\n- `resume status --root path [--limit 10 --pending --json]` — inspect the resume state; `--pending` shows outstanding URLs, `--json` emits `completed_entries` + `pending_entries` for automation.\n- `demo snapshot|stream|events` — exercise the demo endpoints without hitting a live pipeline.\n\nThe CLI reads `API_BASE_URL` + `MDWB_API_KEY` from `.env`; override with `--api-base` when targeting staging. For CUDA/vLLM workflows, see `docs/olmocr_cli_tool_documentation.md` and `docs/olmocr_cli_integration.md` for detailed setup + merge notes.\n\n## Agent starter scripts (`scripts/agents/`)\n- `uv run python -m scripts.agents.summarize_article summarize --url https://example.com [--out summary.txt]` — submit (or reuse via `--job-id`) and print/save a short summary (defaults to `--reuse-session`).\n- `uv run python -m scripts.agents.generate_todos todos --job-id \u003cid\u003e [--json] [--out todos.json]` — extract TODO-style bullets (JSON when `--json`, newline text otherwise); accepts `--url` to run a fresh capture and also defaults to `--reuse-session`.\n\nBoth helpers reuse the CLI’s auth + HTTP plumbing, accept the same `--api-base/--http2` flags, fall back to existing jobs when you only need post-processing, and now support `--out` so automations can ingest the results directly.\n\n## Prerequisites \u0026 environment\n- **Chrome for Testing pin:** Set `CFT_VERSION` + `CFT_LABEL` in `.env` so manifests and ops dashboards stay consistent. Re-run `playwright install` whenever the label/build changes.\n- **Transport + viewport:** Defaults (`PLAYWRIGHT_TRANSPORT=cdp`, viewport 1280×2000, DPR 2) live in `app/settings.py` and must align with PLAN §§3, 19.\n- **OCR credentials:** `OLMOCR_SERVER`, `OLMOCR_API_KEY`, and `OLMOCR_MODEL` are required unless you point at `OCR_LOCAL_URL`.\n- **Warning log + blocklist:** Keep `WARNING_LOG_PATH` and `BLOCKLIST_PATH` writable so scroll/overlay incidents are persisted (`docs/config.md` documents every field).\n- **System packages:** Install libvips 8.15+ so the pyvips-based tiler works (`sudo apt-get install libvips` on Debian/Ubuntu, `brew install vips` on macOS). `scripts/run_checks.sh` checks for `pyvips` and fails fast with install instructions unless you explicitly set `SKIP_LIBVIPS_CHECK=1` (for targeted CLI/unit runs on machines without libvips).\n\n## Testing \u0026 quality gates\nRun these before pushing or shipping capture-facing changes:\n\n```bash\nuv run ruff check --fix --unsafe-fixes\nuvx ty check\nnpx playwright test --config=playwright.config.mjs  # or PLAYWRIGHT_BIN=/path/to/playwright-test …\n```\n\n`./scripts/run_checks.sh` wraps the same sequence for CI. Set `PLAYWRIGHT_BIN=/path/to/playwright-test`\nif you need to invoke the Node-based runner; otherwise the script prefers `npx playwright test --config=playwright.config.mjs`\n(which inherits the defaults from PLAN/AGENTS: viewport 1280×2000, DPR 2, reduced motion, light scheme, mask selectors, CDP/BiDi transport via `PLAYWRIGHT_TRANSPORT`). When Node Playwright isn’t installed it falls back to `uv run playwright test` and prints a warning if the Python CLI lacks `test`.\nWhen you already know libvips isn’t available in a minimal container, export `SKIP_LIBVIPS_CHECK=1` to bypass the preflight warning. Optional toggles inside `scripts/run_checks.sh`:\n\n- `MDWB_CHECK_METRICS=1` (optionally `CHECK_METRICS_TIMEOUT=\u003cseconds\u003e`) appends the Prometheus health check after pytest/Playwright.\n- `MDWB_RUN_E2E=1` runs the lightweight placeholder suite in `tests/test_e2e_small.py` so CI can keep a fast E2E sentinel without invoking FlowLogger.\n- `MDWB_RUN_E2E_RICH=1` runs the full FlowLogger scenarios in `tests/test_e2e_cli.py`; transcript artifacts are copied to `tmp/rich_e2e_cli/` (override via `RICH_E2E_ARTIFACT_DIR=/path/to/dir`) so operators can review the panels/tables/progress output without hunting through pytest temp dirs.\n- `MDWB_RUN_E2E_GENERATED=1` runs the generative guardrail suite (`tests/test_e2e_generated.py`). Point `MDWB_GENERATED_E2E_CASES=/path/to/cases.json` at a bespoke cases file when you need to refresh or extend the Markdown baselines.\n\nGrab the resulting `tmp/rich_e2e_cli/*.log|*.html` files in CI for postmortems.\n\n- The bundled pytest targets now include the store/manifest persistence suite (`tests/test_store_manifest.py`, `tests/test_manifest_contract.py`),\n  the Prometheus CLI health checks (`tests/test_check_metrics.py`), and the ops regressions for `show_latest_smoke`/`update_smoke_pointers` in addition\n  to the CLI coverage. This keeps RunRecord fields, smoke pointer tooling, and metrics hooks under CI without needing a live API server.\n\n- Playwright defaults to the Chrome for Testing build. Leave `PLAYWRIGHT_CHANNEL` unset (or set it to `cft`) so local smoke runs match the capture pipeline; if you have to fall back to stock Chromium, set `PLAYWRIGHT_CHANNEL=chromium` or use a comma-separated preference such as `PLAYWRIGHT_CHANNEL=\"chromium,cft\"`. Likewise, keep `PLAYWRIGHT_TRANSPORT=cdp` unless you are explicitly exercising WebDriver BiDi—when you do, a value like `PLAYWRIGHT_TRANSPORT=\"bidi,cdp\"` makes the preferred/fallback order obvious to anyone reading CI metadata.\n\n- Every `run_checks` invocation now emits `tmp/pytest_report.xml` and `tmp/pytest_summary.json`\n  (override with `PYTEST_JUNIT_PATH`/`PYTEST_SUMMARY_PATH`). The JSON digest lists totals and the first few\n  failing test names, so CI/Agent Mail can quote failures without re-running pytest.\n\nAlso run `uv run python scripts/check_env.py` whenever `.env` changes—CI and nightly smokes depend on it to confirm CfT pins + OCR secrets.\n\nAdditional expectations (per PLAN §§14, 19.10, 22):\n- Keep nightly smokes green via `uv run python scripts/run_smoke.py --date $(date -u +%Y-%m-%d)`.\n- Refresh `benchmarks/production/weekly_summary.json` (generated automatically by the smoke script) for Monday ops reports.\n- Run `uv run python scripts/check_metrics.py --check-weekly` (with the default `benchmarks/production/weekly_summary.json`) before handoff so we fail fast when capture/OCR SLO p99 values exceed their 2×p95 budgets.\n- Tail `ops/warnings.jsonl` or `mdwb warnings` for canvas/video/overlay spikes.\n\n## Day-to-day workflow\n- **Reserve + communicate:** Before editing, reserve files and announce the pickup via Agent Mail (cite the bead id). Keep PLAN sections annotated with `_Status — \u003cagent\u003e` entries so the written record matches reality.\n- **Track via beads:** Use `bd list/show` to pick the next unblocked issue, add comments for status updates, and close with findings/tests noted.\n- **Run the required checks:** `ruff`, `ty`, Playwright smoke, `scripts/check_env.py`, plus any bead-specific tests (e.g., sqlite-vec search or CLI watch). Never skip the capture smoke after touching Playwright/OCR code.\n- **Sync docs:** README, PLAN, `docs/config.md`, and `docs/ops.md` must stay consistent; update them alongside code changes so ops can trust the written guidance.\n- **Ops handoff:** For capture/OCR fixes, capture job ids + manifest paths in your bead comment and Mail thread so others can reproduce issues quickly.\n\n## Operations \u0026 automation\n- `scripts/run_smoke.py` — nightly URL set capture + manifest/latency aggregation.\n- `scripts/show_latest_smoke.py` — quick pointers to the latest smoke outputs; manifest rows now include overlap ratios, validation failure counts, and seam marker/hash counts so regressions stand out. The `--weekly` view prints seam marker percentiles plus capture/OCR SLO status (p99 vs 2×p95) using the data generated by the nightly smoke script, and `--slo` renders the aggregated `latest_slo_summary.json` table (counts, p50/p95 capture/OCR/total, budget breaches). It now fails fast when `latest.txt` exists but is empty, so rerun `scripts/update_smoke_pointers.py \u003crun-dir\u003e` whenever the pointer guard triggers.\n- `scripts/olmocr_cli.py` + `docs/olmocr_cli.md` — hosted olmOCR orchestration/diagnostics.\n- `scripts/analyze_stitch.py` — lightweight research helper that reads a manifest index and reports seam counts/hash diversity plus hyphen-break dom-assist incidents per run (optionally `--json`). Handy for bd-0jc overlap/SSIM and hyphen guard experiments.\n- Weekly seam telemetry — run `uv run python scripts/show_latest_smoke.py --weekly --json` (or parse `benchmarks/production/weekly_summary.json`) to pull `seam_markers.count`/`hashes`/`events` p50/p95 for every category. Feed those numbers straight into Grafana/Prometheus so seam regressions (or fallback spikes) page operators alongside capture/OCR SLO breaches, and archive `weekly_slo.json`/`.prom` for the rolling capture/OCR SLO window.\n- `mdwb jobs replay manifest \u003cmanifest.json\u003e` — re-run a job with a stored manifest via `POST /replay` (accepts `--api-base`, `--http2`, `--json`); keep `scripts/replay_job.sh` around for legacy automation until everything points at the CLI.\n- `mdwb jobs show \u003cjob-id\u003e` — inspect the latest snapshot plus sweep stats/validation issues in one table. When manifests are missing (cached jobs, trimmed SSE payloads), the CLI still prints stored seam counts (`Seam markers: X (unique hashes: Y)`) so you can spot duplicate sweeps without spelunking manifests. `mdwb diag --ocr-metrics` shows the detailed seam marker table when manifests are available.\n- `scripts/update_smoke_pointers.py \u003crun-dir\u003e [--root path]` — refresh `latest_summary.md`, `latest_manifest_index.json`, and `latest_metrics.json` after ad-hoc smoke runs so dashboards point at the right data (defaults to `MDWB_SMOKE_ROOT` unless `--root` is provided; add `--weekly-source` when overriding the rolling summary). The command now computes `latest_slo_summary.json` by default using the manifest index + PLAN §22 budget file; pass `--no-compute-slo` to skip or `--budget-file` to point at an alternate budget definition.\n- `scripts/check_metrics.py` — ping `/metrics` plus the exporter; supports `--api-base`, `--exporter-url`, `--json`, and now `--check-weekly` (validates `benchmarks/production/weekly_summary.json` so release builds fail fast if the rolling SLOs are blown). When you pass `--check-weekly --json`, the CLI always emits a `weekly` block (status, `summary_path`, `failures`) even if the summary file is missing/unreadable, which makes automation logs self-explanatory. `scripts/prom_scrape_check.py` remains as a compatibility wrapper but simply re-exports the same Typer CLI.\n- `scripts/compute_slo.py` — consumes the latest `latest_manifest_index.json` (or any manifest index) plus the benchmark budget file to produce capture/OCR SLO summaries. The CLI writes a JSON report (`--out benchmarks/production/latest_slo_summary.json`) and optionally emits Prometheus textfile metrics via `--prom-output tmp/mdwb_slo.prom`, enabling dashboards/alerts to track per-category p95 totals, breach ratios, and overall SLO status. `scripts/run_smoke.py` invokes this automatically after each smoke run so `benchmarks/production/latest_slo_summary.json` and `latest_slo.prom` stay fresh; rerun manually when you need ad-hoc SLO snapshots.\n- Prometheus metrics now cover capture/OCR/stitch durations, warning/blocklist counts, job completions, and SSE heartbeats via `prometheus-fastapi-instrumentator`. Scrape `/metrics` on the API port or hit the background exporter on `PROMETHEUS_PORT` (default 9000); docs/ops.md lists the metric names + alert hooks.\n- Set `MDWB_CHECK_METRICS=1` (optionally `CHECK_METRICS_TIMEOUT=\u003cseconds\u003e`) when running `scripts/run_checks.sh` to include the Prometheus smoke (`scripts/check_metrics.py`) alongside the usual lint/type/pytest/Playwright stack.\n\n### Handy commands\n```bash\n# Validate env\nuv run python scripts/check_env.py\n\n# Run cli demo job\nuv run python scripts/mdwb_cli.py demo stream\n\n# Replay an existing manifest\nuv run python scripts/mdwb_cli.py jobs replay manifest cache/example.com/.../manifest.json\n\n# Search embeddings for a run (vector as JSON array)\nuv run python scripts/mdwb_cli.py jobs embeddings search JOB_ID --vector \"[0.12, 0.04, ...]\" --top-k 3\n\n# Tail warning log via CLI\nuv run python scripts/mdwb_cli.py warnings --count 25\n\n# Download a job's tar bundle (tiles + markdown + manifest)\nuv run python scripts/mdwb_cli.py jobs bundle \u003cjob-id\u003e --out path/to/bundle.tar.zst\n\n# Run nightly smoke for docs/articles only (dry run)\nuv run python scripts/run_smoke.py --date $(date -u +%Y-%m-%d) --category docs_articles --dry-run\n```\n\n## Artifacts you should expect per job\n- `artifact/tiles/tile_*.png` — viewport-sweep tiles (≤1288 px long side) with overlap + SHA metadata.\n- `/jobs/{id}/artifact/highlight?tile=…\u0026y0=…\u0026y1=…` — quick HTML viewer that overlays the region referenced by each provenance comment (handy for code reviews and incident reports).\n- `out.md` — final Markdown with DOM-guided heading normalization plus provenance comments (`\u003c!-- source: tile_i ... , path=…, highlight=/jobs/... --\u003e`) and Links Appendix.\n- `links.json` — anchors/forms/headings/meta harvested from the DOM snapshot.\n- `manifest.json` — CfT label/build, Playwright version, screenshot style hash, warnings, sweep stats, timings, and the new `seam_marker_events` list whenever seam hints were required to align tiles.\n- `dom_snapshot.html` — raw DOM capture used for link diffs and hybrid recovery (when enabled).\n- `bundle.tar.zst` — optional tarball for incidents/export (`Store.build_bundle`).\n- Markdown output now includes seam markers (`\u003c!-- seam-marker … --\u003e`) and enriched provenance comments (`viewport_y`, `overlap_px`, highlight links) plus detailed `\u003c!-- table-header-trimmed reason=… --\u003e` breadcrumbs so reviewers can jump straight to stitched regions.\n\nUse `mdwb jobs bundle …` or `mdwb jobs artifacts manifest …` (or `/jobs/{id}/artifact/...`) to reproduce a job locally and fetch its artifacts for debugging.\n\n## Communication \u0026 task tracking\n- **Beads** (`bd ...`) track every feature/bug (map bead IDs to Plan sections in Agent Mail threads).\n- **Agent Mail** (MCP) is the coordination channel—reserve files before editing, summarize work in the relevant bead thread, and note Plan updates inline (_see §§10–11 for example status notes_).\n\n## Further reading\n- `AGENTS.md` — ground rules (no destructive git cmds, uv usage, capture policies).\n- `PLAN_TO_IMPLEMENT_MARKDOWN_WEB_BROWSER_PROJECT.md` — canonical spec + incremental upgrades.\n- `docs/architecture.md` — best practices + data flow diagrams.\n- `docs/blocklist.md`, `docs/config.md`, `docs/models.yaml`, `docs/ops.md`, `docs/olmocr_cli.md` — supporting specs.\n- `docs/release_checklist.md` — step-by-step release \u0026 regression runbook (CfT/Playwright/model toggles, smoke commands, artifact list).\n\n## ❓ FAQ \u0026 Troubleshooting\n\n### Common Issues\n\n**Q: Why is OCR slow/failing?**\n```bash\n# Check OCR quota and performance\nmdwb jobs ocr-metrics job123\nmdwb warnings --count 20 | grep -i \"ocr\\|quota\"\n\n# Reduce concurrency if hitting rate limits\nexport OCR_MAX_CONCURRENCY=5\n```\n\n**A: Common causes:**\n- OCR API rate limiting (reduce `OCR_MAX_CONCURRENCY`)\n- Network latency to OCR service (consider local olmOCR)\n- Complex images requiring more processing time\n\n**Q: Poor OCR accuracy on my content?**\n```bash\n# Check image quality in tiles\nls cache/your-site.com/job123/artifact/tiles/\n# View highlight links for problematic sections\ncurl \"http://localhost:8000/jobs/job123/artifact/highlight?tile=5\u0026y0=100\u0026y1=200\"\n```\n\n**A: Optimization strategies:**\n- Increase viewport size for better text rendering\n- Use authenticated profiles for login-walled content\n- Check for overlay/popup interference in manifest warnings\n\n**Q: Missing content compared to browser view?**\n```bash\n# Check DOM snapshot vs final markdown\ncurl \"http://localhost:8000/jobs/job123/artifact/dom_snapshot.html\"\nmdwb dom links --job-id job123\n```\n\n**A: Common causes:**\n- JavaScript-heavy SPAs (content loads after initial render)\n- Authentication required (use `--profile` with logged-in state)\n- Overlays/popups blocking content (check blocklist configuration)\n\n### Performance Optimization\n\n**Slow jobs:**\n1. **Check tile count**: `mdwb diag job123` - High tile counts increase OCR time\n2. **Review warnings**: `mdwb warnings` - Canvas/scroll issues affect performance\n3. **Monitor concurrency**: OCR auto-tune may be throttling due to latency\n\n**Memory issues:**\n1. **Large pages**: Set `TILE_MAX_SIZE` lower to reduce memory per tile\n2. **Concurrent jobs**: Limit active jobs with `MAX_ACTIVE_JOBS`\n3. **Cache cleanup**: Implement retention policy for old artifacts\n\n**Network problems:**\n1. **OCR connectivity**: Test with `curl ${OLMOCR_SERVER}/health`\n2. **Firewall issues**: Ensure outbound HTTPS access\n3. **Proxy configuration**: Set HTTP_PROXY/HTTPS_PROXY if needed\n\n### Error Code Reference\n\n| Error Code | Meaning | Solution |\n|------------|---------|----------|\n| `OCR_QUOTA_EXCEEDED` | Hit API rate limits | Wait or increase quota |\n| `SCREENSHOT_TIMEOUT` | Page load too slow | Increase timeout, check URL |\n| `TILE_PROCESSING_FAILED` | Image processing error | Check libvips installation |\n| `MANIFEST_VALIDATION_FAILED` | Corrupt job state | Restart job, check disk space |\n| `DOM_SNAPSHOT_FAILED` | Can't save DOM | Check write permissions |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDicklesworthstone%2Fmarkdown_web_browser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDicklesworthstone%2Fmarkdown_web_browser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDicklesworthstone%2Fmarkdown_web_browser/lists"}