https://github.com/kreuzberg-dev/html-to-markdown
High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.
https://github.com/kreuzberg-dev/html-to-markdown
hocr html html-converter markdown markdown-converter rag text-extraction text-processing
Last synced: 2 days ago
JSON representation
High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.
- Host: GitHub
- URL: https://github.com/kreuzberg-dev/html-to-markdown
- Owner: kreuzberg-dev
- License: mit
- Created: 2025-02-03T16:18:12.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2026-01-15T09:54:32.000Z (11 days ago)
- Last Synced: 2026-01-15T15:28:28.024Z (11 days ago)
- Topics: hocr, html, html-converter, markdown, markdown-converter, rag, text-extraction, text-processing
- Language: HTML
- Homepage:
- Size: 12.1 MB
- Stars: 477
- Watchers: 5
- Forks: 46
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# html-to-markdown

High-performance HTML → Markdown conversion powered by Rust. Shipping as a Rust crate, Python package, PHP extension, Ruby gem, Elixir Rustler NIF, Node.js bindings, WebAssembly, and standalone CLI with identical rendering behavior across all runtimes.
## Key Features
- **Blazing Fast** – Rust-powered core delivers 10-80× faster conversion than pure Python alternatives (150–280 MB/s)
- **Polyglot** – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, PHP, Go, Java, C#, and Elixir
- **Smart Conversion** – Handles complex documents including nested tables, code blocks, task lists, and hOCR OCR output
- **Metadata Extraction** – Extract document metadata (title, description, headers, links, images, structured data) alongside conversion
- **Visitor Pattern** – Custom callbacks for domain-specific dialects, content filtering, URL rewriting, accessibility validation
- **Highly Configurable** – Control heading styles, code block fences, list formatting, whitespace handling, and HTML sanitization
- **Tag Preservation** – Keep specific HTML tags unconverted when markdown isn't expressive enough
- **Secure by Default** – Built-in HTML sanitization prevents malicious content
- **Consistent Output** – Identical markdown rendering across all language bindings
**[Try the Live Demo →](https://kreuzberg-dev.github.io/html-to-markdown/)**
## Installation
Each language binding provides comprehensive documentation with installation instructions, examples, and best practices. Choose your platform to get started:
**Scripting Languages:**
- **[Python](./packages/python/README.md)** – PyPI package, metadata extraction, visitor pattern, CLI included
- **[Ruby](./packages/ruby/README.md)** – RubyGems package, RBS type definitions, Steep checking
- **[PHP](./packages/php/README.md)** – Composer package + PIE extension, PHP 8.2+, PHPStan level 9
- **[Elixir](./packages/elixir/README.md)** – Hex package, Rustler NIF bindings, Elixir 1.19+
**JavaScript/TypeScript:**
- **[Node.js / TypeScript](./packages/typescript/README.md)** – Native NAPI-RS bindings for Node.js/Bun, fastest performance, WebAssembly for browsers/Deno
**Compiled Languages:**
- **[Go](./packages/go/v2/README.md)** – Go module with FFI bindings, automatic library download
- **[Java](./packages/java/README.md)** – Maven Central, Panama Foreign Function & Memory API, Java 24+
- **[C#](./packages/csharp/README.md)** – NuGet package, .NET 8.0+, P/Invoke FFI bindings
**Native:**
- **[Rust](./crates/html-to-markdown/README.md)** – Core library, flexible feature flags, zero-copy APIs
**Command-Line:**
- **[CLI](https://crates.io/crates/html-to-markdown-cli)** – Cross-platform binary via `cargo install html-to-markdown-cli` or [Homebrew](https://formulae.brew.sh/formula/html-to-markdown)
Metadata Extraction
Extract comprehensive metadata during conversion: title, description, headers, links, images, structured data (JSON-LD, Microdata, RDFa). Use cases: SEO extraction, table-of-contents generation, link validation, accessibility auditing, content migration.
**[Metadata Extraction Guide →](./examples/metadata-extraction/)**
Visitor Pattern
Customize HTML→Markdown conversion with callbacks for specific elements. Intercept links, images, headings, lists, and more. Use cases: domain-specific Markdown dialects (Obsidian, Notion), content filtering, URL rewriting, accessibility validation, analytics.
Supported in: Rust, Python (sync & async), TypeScript/Node.js (sync & async), Ruby, and PHP.
**[Visitor Pattern Guide →](./examples/visitor-pattern/)**
### Visitor Support Matrix
| Binding | Visitor Support | Async Support | Best For |
|---------|-----------------|---------------|----------|
| **Rust** | ✅ Yes | ✅ Tokio | Core library, performance-critical code |
| **Python** | ✅ Yes | ✅ asyncio | Server-side, bulk processing |
| **TypeScript/Node.js** | ✅ Yes | ✅ Promise-based | Server-side Node.js/Bun, best performance |
| **Ruby** | ✅ Yes | ❌ No | Server-side Ruby on Rails, Sinatra |
| **PHP** | ✅ Yes | ❌ No | Server-side PHP, content management |
| **Go** | ❌ No | — | Basic conversion only |
| **Java** | ❌ No | — | Basic conversion only |
| **C#** | ❌ No | — | Basic conversion only |
| **Elixir** | ❌ No | — | Basic conversion only |
| **WebAssembly** | ❌ No | — | Browser, Edge, Deno (FFI limitations) |
For WASM users needing visitor functionality, see [WASM Visitor Alternatives](./crates/html-to-markdown-wasm/README.md#visitor-pattern-support) for recommended approaches.
Performance & Benchmarking
Rust-powered core delivers 150–280 MB/s throughput (10-80× faster than pure Python alternatives). Includes benchmarking tools, memory profiling, streaming strategies, and optimization tips.
**[Performance Guide →](./examples/performance/)**
Tag Preservation
Keep specific HTML tags unconverted when Markdown isn't expressive enough. Useful for tables, SVG, custom elements, or when you need mixed HTML/Markdown output.
See language-specific documentation for `preserveTags` configuration.
Skipping Images
Skip all images during conversion using the `skip_images` option. Useful for text-only extraction or when you want to filter out visual content.
**Rust:**
```rust
use html_to_markdown_rs::{convert, ConversionOptions};
let options = ConversionOptions {
skip_images: true,
..Default::default()
};
let html = r#"
Text with
image
"#;
let markdown = convert(html, Some(options))?;
// Output: "Text with image" (image tags are removed)
```
**Python:**
```python
from html_to_markdown import convert, ConversionOptions
options = ConversionOptions(skip_images=True)
markdown = convert(html, options)
```
**TypeScript/Node.js:**
```typescript
import { convert, ConversionOptions } from '@kreuzberg/html-to-markdown-node';
const options: ConversionOptions = {
skipImages: true,
};
const markdown = convert(html, options);
```
**Ruby:**
```ruby
require 'html_to_markdown'
options = HtmlToMarkdown::ConversionOptions.new(skip_images: true)
markdown = HtmlToMarkdown.convert(html, options)
```
**PHP:**
```php
use Goldziher\HtmlToMarkdown\HtmlToMarkdown;
use Goldziher\HtmlToMarkdown\Options;
$options = new Options(['skip_images' => true]);
$markdown = HtmlToMarkdown::convert($html, $options);
```
This option is available across all language bindings. When enabled, all `
` tags and their associated markdown image syntax are removed from the output.
Secure by Default
Built-in HTML sanitization prevents XSS attacks and malicious content. Powered by ammonia with safe defaults. Configurable via `sanitize` options.
## Contributing
Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on:
- Setting up the development environment
- Running tests locally (Rust 95%+ coverage, language bindings 80%+)
- Submitting pull requests
- Reporting issues
All contributions must follow code quality standards enforced via pre-commit hooks (prek).
## License
MIT License – see [LICENSE](LICENSE) for details. You can use html-to-markdown freely in both commercial and closed-source products with no obligations, no viral effects, and no licensing restrictions.