{"id":51382668,"url":"https://github.com/yingkitw/web2md","last_synced_at":"2026-07-03T17:07:38.790Z","repository":{"id":368872419,"uuid":"1285608792","full_name":"yingkitw/web2md","owner":"yingkitw","description":null,"archived":false,"fork":false,"pushed_at":"2026-07-02T14:40:18.000Z","size":47,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-07-02T16:29:50.330Z","etag":null,"topics":["browser","markdown","md","web"],"latest_commit_sha":null,"homepage":"https://web2md.net","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yingkitw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-07-01T01:29:31.000Z","updated_at":"2026-07-02T14:42:53.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/yingkitw/web2md","commit_stats":null,"previous_names":["yingkitw/web2md"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/yingkitw/web2md","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yingkitw%2Fweb2md","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yingkitw%2Fweb2md/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yingkitw%2Fweb2md/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yingkitw%2Fweb2md/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yingkitw","download_url":"https://codeload.github.com/yingkitw/web2md/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yingkitw%2Fweb2md/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":35094171,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-03T02:00:05.635Z","response_time":110,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["browser","markdown","md","web"],"created_at":"2026-07-03T17:07:38.046Z","updated_at":"2026-07-03T17:07:38.784Z","avatar_url":"https://github.com/yingkitw.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web2MD\n\nA tool that fetches web pages and returns them as Markdown. Designed to minimize token usage when invoked via MCP (Model Context Protocol).\n\n## Why?\n\nRaw HTML is noisy: scripts, styles, ads, and markup bloat consume LLM context window. Web2MD fetches a page and converts it to clean Markdown, preserving content hierarchy while stripping everything non-essential.\n\n## Quick Start\n\n```bash\n# Interactive terminal browser (default — Lynx-like)\ncargo run -- https://example.com\n# [1-20] follow numbered link  [b]ack  [f]orward  [u] enter URL  [q]uit\n\n# Fetch a page as Markdown (one-shot)\ncargo run -- fetch https://example.com\n\n# Limit output length\ncargo run -- fetch https://example.com --max-length 4000\n\n# Render with ANSI colors in the terminal (bold headings, underlined links)\ncargo run -- fetch https://example.com --render\n\n# Add a polite delay between requests (milliseconds)\ncargo run -- fetch https://example.com --delay 500\n\n# Interactive terminal browser (explicit)\ncargo run -- browse https://example.com\n\n# Run as MCP server (stdio JSON-RPC)\ncargo run -- mcp\n```\n\n## Features\n\n- **Interactive terminal browser** (`browse`): Lynx-like navigation with numbered links, back/forward history\n- **ANSI rendering** (`--render`): Bold headings, underlined cyan links, colored code blocks in terminal output\n- **Table rendering**: Markdown tables drawn with box-drawing characters (`┌─┬─┐`)\n- **Iframe inlining**: Fetches `\u003ciframe src=\"...\"\u003e` content and embeds it into the parent page\n- **Noise reduction**: Strips `\u003cscript\u003e`, `\u003cstyle\u003e`, `\u003ciframe\u003e`, `\u003cnav\u003e`, `\u003cfooter\u003e`, `\u003caside\u003e`, `\u003cnoscript\u003e`, `\u003cform\u003e`, `\u003cheader\u003e`, HTML comments, and excessive whitespace (use `--keep-header` to preserve `\u003cheader\u003e`)\n- **Content deduplication**: Removes duplicate paragraph-level blocks to further reduce token output\n- **Main content extraction** (`--main-content`): Extracts `\u003carticle\u003e`, `\u003cmain\u003e`, or `[role=\"main\"]` content; falls back to readability scoring (text-density vs link-density) on `\u003cdiv\u003e`/`\u003csection\u003e` blocks for pages without semantic tags\n- **Code language detection**: Preserves language annotations from `\u003ccode class=\"language-xxx\"\u003e` as fenced block languages (` ```rust `)\n- **Auth support**: Cookies (`--cookie`) and custom headers (`--header`) for authenticated pages\n- **Rate limiting** (`--delay`): Polite delay between consecutive requests to avoid hammering servers\n- **Caching** (`--cache-ttl`): In-memory cache with configurable TTL to avoid re-fetching the same URL\n- **MCP server**: stdio JSON-RPC transport for LLM tool integration\n- **Metadata extraction**: Title, meta description, Open Graph description, author, and publication date returned in MCP response\n\n## Architecture\n\n- **Browser** (`browser.rs`): Minimal HTTP client with iframe inlining. No rendering engine—intentionally lightweight.\n- **PageToMarkdown** (`markdown.rs`): HTML-to-Markdown conversion. Strips scripts, styles, iframes, images (optional).\n- **McpServer** (`mcp.rs`): JSON-RPC server wrapper exposing a `fetch` tool.\n- **CLI** (`main.rs`): `fetch` (one-shot), `browse` (interactive), `mcp` (server). Default mode is `browse`.\n\nSee [ARCHITECTURE.md](ARCHITECTURE.md) for details.\n\n## Project Status\n\nSee [TODO.md](TODO.md) for remaining work and [SPEC.md](SPEC.md) for protocol contracts.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyingkitw%2Fweb2md","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyingkitw%2Fweb2md","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyingkitw%2Fweb2md/lists"}