{"id":45167430,"url":"https://github.com/rocklambros/any2md","last_synced_at":"2026-04-26T19:00:45.649Z","repository":{"id":337163404,"uuid":"1152554925","full_name":"rocklambros/any2md","owner":"rocklambros","description":"Convert PDF, DOCX, HTML, and TXT files — or web pages by URL — to clean, LLM-optimized Markdown with YAML frontmatter.","archived":false,"fork":false,"pushed_at":"2026-04-19T17:04:14.000Z","size":76,"stargazers_count":6,"open_issues_count":2,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-19T19:10:05.787Z","etag":null,"topics":["cli","converter","docx","html","llm","markdown","pdf","python","security","txt"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rocklambros.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"rocklambros"}},"created_at":"2026-02-08T03:41:14.000Z","updated_at":"2026-04-19T17:03:34.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/rocklambros/any2md","commit_stats":null,"previous_names":["rocklambros/pdf-to-markdown","rocklambros/mdconv","rocklambros/any2md"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/rocklambros/any2md","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rocklambros%2Fany2md","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rocklambros%2Fany2md/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rocklambros%2Fany2md/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rocklambros%2Fany2md/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rocklambros","download_url":"https://codeload.github.com/rocklambros/any2md/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rocklambros%2Fany2md/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32308878,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T17:23:19.671Z","status":"ssl_error","status_checked_at":"2026-04-26T17:23:19.195Z","response_time":129,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","converter","docx","html","llm","markdown","pdf","python","security","txt"],"created_at":"2026-02-20T07:04:14.707Z","updated_at":"2026-04-26T19:00:45.642Z","avatar_url":"https://github.com/rocklambros.png","language":"Python","funding_links":["https://github.com/sponsors/rocklambros"],"categories":[],"sub_categories":[],"readme":"# any2md\n\nConvert PDF, DOCX, HTML, and TXT files — or web pages by URL — to clean, LLM-optimized Markdown with YAML frontmatter.\n\nOne command. Any format. Consistent, structured output ready for language models.\n\n## Quick Start\n\n```bash\npip install any2md\n\nany2md report.pdf\nany2md https://example.com/article\nany2md --help\n```\n\nOutput lands in `./Text/` by default:\n\n```markdown\n---\ntitle: \"Quarterly Financial Report\"\nsource_file: \"report.pdf\"\npages: 12\ntype: pdf\n---\n\n# Quarterly Financial Report\n\nDocument content here...\n```\n\n## Features\n\n| Feature | Description |\n|---------|-------------|\n| **Multi-format** | PDF, DOCX, HTML (.html, .htm), TXT |\n| **URL fetching** | Pass any http/https URL as input |\n| **YAML frontmatter** | Title, source, page/word count, type |\n| **Batch processing** | Single file, directory scan, or mixed inputs |\n| **Auto-routing** | Dispatches to the correct converter by extension |\n| **Smart skip** | Won't overwrite existing files unless `--force` |\n| **Filename sanitization** | Spaces, special characters, unicode dashes handled |\n| **TXT structure detection** | Infers headings, lists, code blocks from plain text |\n| **Title extraction** | Pulls the first H1–H3 heading automatically |\n| **Link stripping** | `--strip-links` removes hyperlinks, keeps text |\n| **SSRF protection** | Blocks requests to private/reserved/loopback IPs |\n| **File size limits** | Configurable max file size via `--max-file-size` |\n| **Lazy loading** | Converter imports deferred until needed for fast startup |\n\n## Installation\n\nRequires **Python 3.10+**.\n\n```bash\npip install any2md\n```\n\n### From source\n\n```bash\ngit clone https://github.com/rocklambros/any2md.git\ncd any2md\npip install .\n```\n\n### Dependencies\n\n| Library | Purpose |\n|---------|---------|\n| [PyMuPDF](https://pymupdf.readthedocs.io/) + [pymupdf4llm](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) | PDF extraction |\n| [mammoth](https://github.com/mwilliamson/python-mammoth) + [markdownify](https://github.com/matthewwithanm/python-markdownify) | DOCX conversion |\n| [trafilatura](https://trafilatura.readthedocs.io/) + [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) | HTML/URL extraction |\n| [lxml](https://lxml.de/) | Fast HTML parsing |\n\n## Usage\n\n### Basic conversion\n\n```bash\n# Single file\nany2md report.pdf\n\n# Multiple files\nany2md report.pdf proposal.docx \"meeting notes.pdf\"\n\n# HTML file\nany2md page.html\n\n# Web page by URL\nany2md https://example.com/article\n\n# Plain text file\nany2md notes.txt\n\n# Mixed batch — PDFs, DOCX, HTML, TXT, and URLs together\nany2md doc.pdf page.html notes.txt https://example.com\n```\n\n### Directory scanning\n\n```bash\n# Scan a specific directory\nany2md --input-dir ./documents\n\n# Convert everything in the current directory (default behavior)\nany2md\n```\n\n### Options\n\n```bash\n# Custom output directory\nany2md -o ./converted report.pdf\n\n# Overwrite existing files\nany2md --force\n\n# Strip hyperlinks from output\nany2md --strip-links doc.pdf\n\n# Combine options\nany2md -f -o ./out --strip-links docs/*.pdf docs/*.docx\n```\n\n### Alternative invocations\n\n```bash\n# Module mode (works without installing via pip)\npython -m any2md report.pdf\n\n# Legacy script (backward compatibility)\npython3 mdconv.py report.pdf\n```\n\n## Output Format\n\nEvery converted file has YAML frontmatter followed by cleaned Markdown. The frontmatter fields vary by source format:\n\n**PDF** — includes page count:\n\n```markdown\n---\ntitle: \"Quarterly Financial Report\"\nsource_file: \"Q3 Report 2024.pdf\"\npages: 12\ntype: pdf\n---\n```\n\n**DOCX** — includes word count:\n\n```markdown\n---\ntitle: \"Project Proposal\"\nsource_file: \"proposal.docx\"\nword_count: 3847\ntype: docx\n---\n```\n\n**HTML file** — includes word count:\n\n```markdown\n---\ntitle: \"Page Title\"\nsource_file: \"page.html\"\nword_count: 1234\ntype: html\n---\n```\n\n**TXT** — structure inferred via heuristics, includes word count:\n\n```markdown\n---\ntitle: \"Meeting Notes\"\nsource_file: \"notes.txt\"\nword_count: 892\ntype: txt\n---\n```\n\n**URL** — records source URL instead of filename:\n\n```markdown\n---\ntitle: \"Article Title\"\nsource_url: \"https://example.com/article\"\nword_count: 567\ntype: html\n---\n```\n\n## CLI Reference\n\n```\nusage: any2md [-h] [--input-dir PATH] [--force] [--output-dir PATH] [--strip-links] [files ...]\n\nConvert PDF, DOCX, HTML, and TXT files to LLM-optimized Markdown.\n\npositional arguments:\n  files                 Files or URLs to convert. Supports PDF, DOCX, HTML,\n                        TXT files and http(s) URLs. If omitted, converts all\n                        supported files in the current directory.\n\noptions:\n  -h, --help            show this help message and exit\n  --input-dir, -i PATH  Directory to scan for supported files (PDF, DOCX, HTML, TXT)\n  --force, -f           Overwrite existing .md files\n  --output-dir, -o PATH Output directory (default: ./Text)\n  --strip-links         Remove markdown links, keeping only the link text\n  --max-file-size BYTES Maximum file size in bytes (default: 104857600)\n```\n\n## Architecture\n\n```\nUser Input (files, URLs, flags)\n         │\n         ▼\n      cli.py ─── parse args, classify URLs vs file paths\n         │\n         ▼\nconverters/__init__.py ─── dispatch by extension\n         │\n    ┌────┼────┬────┐\n    ▼    ▼    ▼    ▼\n pdf  docx  html  txt ─── format-specific extraction\n    │    │    │    │\n    └────┼────┴────┘\n         ▼\n      utils.py ─── clean, title-extract, sanitize, frontmatter\n         │\n         ▼\n      Output ─── YAML frontmatter + Markdown → output_dir/\n```\n\n### Extraction pipelines\n\n| Format | Pipeline |\n|--------|----------|\n| **PDF** | `pymupdf4llm.to_markdown()` → clean → frontmatter |\n| **DOCX** | `mammoth` (DOCX → HTML) → `markdownify` (HTML → Markdown) → clean → frontmatter |\n| **HTML/URL** | `trafilatura` extract with markdown output (fallback: BS4 pre-clean → `markdownify`) → clean → frontmatter |\n| **TXT** | `structurize()` heuristics (headings, lists, code blocks) → clean → frontmatter |\n\n### Adding a new format\n\n1. Create `any2md/converters/newformat.py` with a `convert_newformat(path, output_dir, force, strip_links_flag) → bool` function\n2. Add the extension and function to `CONVERTERS` in `any2md/converters/__init__.py`\n3. Add the extension to `SUPPORTED_EXTENSIONS`\n\n## Security\n\n- **SSRF protection**: URL fetching validates resolved IPs against private, reserved, loopback, and link-local ranges before making requests.\n- **Scheme validation**: Only `http` and `https` URL schemes are accepted.\n- **File size limits**: Local files exceeding `--max-file-size` (default 100 MB) are skipped. HTML files are also checked before reading.\n- **Input sanitization**: Filenames are stripped of control characters, null bytes, and path separators.\n- **Trust model**: This tool processes local files and fetches URLs you provide. It does not execute embedded scripts or macros from any input format.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frocklambros%2Fany2md","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frocklambros%2Fany2md","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frocklambros%2Fany2md/lists"}