https://github.com/rocklambros/any2md
Convert PDF, DOCX, HTML, and TXT files — or web pages by URL — to clean, LLM-optimized Markdown with YAML frontmatter.
https://github.com/rocklambros/any2md
cli converter docx html llm markdown pdf python security txt
Last synced: 2 months ago
JSON representation
Convert PDF, DOCX, HTML, and TXT files — or web pages by URL — to clean, LLM-optimized Markdown with YAML frontmatter.
- Host: GitHub
- URL: https://github.com/rocklambros/any2md
- Owner: rocklambros
- License: mit
- Created: 2026-02-08T03:41:14.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-04-19T17:04:14.000Z (2 months ago)
- Last Synced: 2026-04-19T19:10:05.787Z (2 months ago)
- Topics: cli, converter, docx, html, llm, markdown, pdf, python, security, txt
- Language: Python
- Size: 74.2 KB
- Stars: 6
- Watchers: 0
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# any2md
Convert PDF, DOCX, HTML, and TXT files — or web pages by URL — to clean, LLM-optimized Markdown with YAML frontmatter.
One command. Any format. Consistent, structured output ready for language models.
## Quick Start
```bash
pip install any2md
any2md report.pdf
any2md https://example.com/article
any2md --help
```
Output lands in `./Text/` by default:
```markdown
---
title: "Quarterly Financial Report"
source_file: "report.pdf"
pages: 12
type: pdf
---
# Quarterly Financial Report
Document content here...
```
## Features
| Feature | Description |
|---------|-------------|
| **Multi-format** | PDF, DOCX, HTML (.html, .htm), TXT |
| **URL fetching** | Pass any http/https URL as input |
| **YAML frontmatter** | Title, source, page/word count, type |
| **Batch processing** | Single file, directory scan, or mixed inputs |
| **Auto-routing** | Dispatches to the correct converter by extension |
| **Smart skip** | Won't overwrite existing files unless `--force` |
| **Filename sanitization** | Spaces, special characters, unicode dashes handled |
| **TXT structure detection** | Infers headings, lists, code blocks from plain text |
| **Title extraction** | Pulls the first H1–H3 heading automatically |
| **Link stripping** | `--strip-links` removes hyperlinks, keeps text |
| **SSRF protection** | Blocks requests to private/reserved/loopback IPs |
| **File size limits** | Configurable max file size via `--max-file-size` |
| **Lazy loading** | Converter imports deferred until needed for fast startup |
## Installation
Requires **Python 3.10+**.
```bash
pip install any2md
```
### From source
```bash
git clone https://github.com/rocklambros/any2md.git
cd any2md
pip install .
```
### Dependencies
| Library | Purpose |
|---------|---------|
| [PyMuPDF](https://pymupdf.readthedocs.io/) + [pymupdf4llm](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) | PDF extraction |
| [mammoth](https://github.com/mwilliamson/python-mammoth) + [markdownify](https://github.com/matthewwithanm/python-markdownify) | DOCX conversion |
| [trafilatura](https://trafilatura.readthedocs.io/) + [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) | HTML/URL extraction |
| [lxml](https://lxml.de/) | Fast HTML parsing |
## Usage
### Basic conversion
```bash
# Single file
any2md report.pdf
# Multiple files
any2md report.pdf proposal.docx "meeting notes.pdf"
# HTML file
any2md page.html
# Web page by URL
any2md https://example.com/article
# Plain text file
any2md notes.txt
# Mixed batch — PDFs, DOCX, HTML, TXT, and URLs together
any2md doc.pdf page.html notes.txt https://example.com
```
### Directory scanning
```bash
# Scan a specific directory
any2md --input-dir ./documents
# Convert everything in the current directory (default behavior)
any2md
```
### Options
```bash
# Custom output directory
any2md -o ./converted report.pdf
# Overwrite existing files
any2md --force
# Strip hyperlinks from output
any2md --strip-links doc.pdf
# Combine options
any2md -f -o ./out --strip-links docs/*.pdf docs/*.docx
```
### Alternative invocations
```bash
# Module mode (works without installing via pip)
python -m any2md report.pdf
# Legacy script (backward compatibility)
python3 mdconv.py report.pdf
```
## Output Format
Every converted file has YAML frontmatter followed by cleaned Markdown. The frontmatter fields vary by source format:
**PDF** — includes page count:
```markdown
---
title: "Quarterly Financial Report"
source_file: "Q3 Report 2024.pdf"
pages: 12
type: pdf
---
```
**DOCX** — includes word count:
```markdown
---
title: "Project Proposal"
source_file: "proposal.docx"
word_count: 3847
type: docx
---
```
**HTML file** — includes word count:
```markdown
---
title: "Page Title"
source_file: "page.html"
word_count: 1234
type: html
---
```
**TXT** — structure inferred via heuristics, includes word count:
```markdown
---
title: "Meeting Notes"
source_file: "notes.txt"
word_count: 892
type: txt
---
```
**URL** — records source URL instead of filename:
```markdown
---
title: "Article Title"
source_url: "https://example.com/article"
word_count: 567
type: html
---
```
## CLI Reference
```
usage: any2md [-h] [--input-dir PATH] [--force] [--output-dir PATH] [--strip-links] [files ...]
Convert PDF, DOCX, HTML, and TXT files to LLM-optimized Markdown.
positional arguments:
files Files or URLs to convert. Supports PDF, DOCX, HTML,
TXT files and http(s) URLs. If omitted, converts all
supported files in the current directory.
options:
-h, --help show this help message and exit
--input-dir, -i PATH Directory to scan for supported files (PDF, DOCX, HTML, TXT)
--force, -f Overwrite existing .md files
--output-dir, -o PATH Output directory (default: ./Text)
--strip-links Remove markdown links, keeping only the link text
--max-file-size BYTES Maximum file size in bytes (default: 104857600)
```
## Architecture
```
User Input (files, URLs, flags)
│
▼
cli.py ─── parse args, classify URLs vs file paths
│
▼
converters/__init__.py ─── dispatch by extension
│
┌────┼────┬────┐
▼ ▼ ▼ ▼
pdf docx html txt ─── format-specific extraction
│ │ │ │
└────┼────┴────┘
▼
utils.py ─── clean, title-extract, sanitize, frontmatter
│
▼
Output ─── YAML frontmatter + Markdown → output_dir/
```
### Extraction pipelines
| Format | Pipeline |
|--------|----------|
| **PDF** | `pymupdf4llm.to_markdown()` → clean → frontmatter |
| **DOCX** | `mammoth` (DOCX → HTML) → `markdownify` (HTML → Markdown) → clean → frontmatter |
| **HTML/URL** | `trafilatura` extract with markdown output (fallback: BS4 pre-clean → `markdownify`) → clean → frontmatter |
| **TXT** | `structurize()` heuristics (headings, lists, code blocks) → clean → frontmatter |
### Adding a new format
1. Create `any2md/converters/newformat.py` with a `convert_newformat(path, output_dir, force, strip_links_flag) → bool` function
2. Add the extension and function to `CONVERTERS` in `any2md/converters/__init__.py`
3. Add the extension to `SUPPORTED_EXTENSIONS`
## Security
- **SSRF protection**: URL fetching validates resolved IPs against private, reserved, loopback, and link-local ranges before making requests.
- **Scheme validation**: Only `http` and `https` URL schemes are accepted.
- **File size limits**: Local files exceeding `--max-file-size` (default 100 MB) are skipped. HTML files are also checked before reading.
- **Input sanitization**: Filenames are stripped of control characters, null bytes, and path separators.
- **Trust model**: This tool processes local files and fetches URLs you provide. It does not execute embedded scripts or macros from any input format.
## License
MIT