An open API service indexing awesome lists of open source software.

https://github.com/kreuzberg-dev/html-to-markdown

High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.
https://github.com/kreuzberg-dev/html-to-markdown

hocr html html-converter markdown markdown-converter rag text-extraction text-processing

Last synced: 2 days ago
JSON representation

High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.

Awesome Lists containing this project

README

          

# html-to-markdown




Rust


Python


Node.js


WASM


Java


Go


C#


PHP


Ruby


Elixir



License

Linkedin- Banner



Discord

High-performance HTML → Markdown conversion powered by Rust. Shipping as a Rust crate, Python package, PHP extension, Ruby gem, Elixir Rustler NIF, Node.js bindings, WebAssembly, and standalone CLI with identical rendering behavior across all runtimes.

## Key Features

- **Blazing Fast** – Rust-powered core delivers 10-80× faster conversion than pure Python alternatives (150–280 MB/s)
- **Polyglot** – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, PHP, Go, Java, C#, and Elixir
- **Smart Conversion** – Handles complex documents including nested tables, code blocks, task lists, and hOCR OCR output
- **Metadata Extraction** – Extract document metadata (title, description, headers, links, images, structured data) alongside conversion
- **Visitor Pattern** – Custom callbacks for domain-specific dialects, content filtering, URL rewriting, accessibility validation
- **Highly Configurable** – Control heading styles, code block fences, list formatting, whitespace handling, and HTML sanitization
- **Tag Preservation** – Keep specific HTML tags unconverted when markdown isn't expressive enough
- **Secure by Default** – Built-in HTML sanitization prevents malicious content
- **Consistent Output** – Identical markdown rendering across all language bindings

**[Try the Live Demo →](https://kreuzberg-dev.github.io/html-to-markdown/)**

## Installation

Each language binding provides comprehensive documentation with installation instructions, examples, and best practices. Choose your platform to get started:

**Scripting Languages:**
- **[Python](./packages/python/README.md)** – PyPI package, metadata extraction, visitor pattern, CLI included
- **[Ruby](./packages/ruby/README.md)** – RubyGems package, RBS type definitions, Steep checking
- **[PHP](./packages/php/README.md)** – Composer package + PIE extension, PHP 8.2+, PHPStan level 9
- **[Elixir](./packages/elixir/README.md)** – Hex package, Rustler NIF bindings, Elixir 1.19+

**JavaScript/TypeScript:**
- **[Node.js / TypeScript](./packages/typescript/README.md)** – Native NAPI-RS bindings for Node.js/Bun, fastest performance, WebAssembly for browsers/Deno

**Compiled Languages:**
- **[Go](./packages/go/v2/README.md)** – Go module with FFI bindings, automatic library download
- **[Java](./packages/java/README.md)** – Maven Central, Panama Foreign Function & Memory API, Java 24+
- **[C#](./packages/csharp/README.md)** – NuGet package, .NET 8.0+, P/Invoke FFI bindings

**Native:**
- **[Rust](./crates/html-to-markdown/README.md)** – Core library, flexible feature flags, zero-copy APIs

**Command-Line:**
- **[CLI](https://crates.io/crates/html-to-markdown-cli)** – Cross-platform binary via `cargo install html-to-markdown-cli` or [Homebrew](https://formulae.brew.sh/formula/html-to-markdown)

Metadata Extraction

Extract comprehensive metadata during conversion: title, description, headers, links, images, structured data (JSON-LD, Microdata, RDFa). Use cases: SEO extraction, table-of-contents generation, link validation, accessibility auditing, content migration.

**[Metadata Extraction Guide →](./examples/metadata-extraction/)**

Visitor Pattern

Customize HTML→Markdown conversion with callbacks for specific elements. Intercept links, images, headings, lists, and more. Use cases: domain-specific Markdown dialects (Obsidian, Notion), content filtering, URL rewriting, accessibility validation, analytics.

Supported in: Rust, Python (sync & async), TypeScript/Node.js (sync & async), Ruby, and PHP.

**[Visitor Pattern Guide →](./examples/visitor-pattern/)**

### Visitor Support Matrix

| Binding | Visitor Support | Async Support | Best For |
|---------|-----------------|---------------|----------|
| **Rust** | ✅ Yes | ✅ Tokio | Core library, performance-critical code |
| **Python** | ✅ Yes | ✅ asyncio | Server-side, bulk processing |
| **TypeScript/Node.js** | ✅ Yes | ✅ Promise-based | Server-side Node.js/Bun, best performance |
| **Ruby** | ✅ Yes | ❌ No | Server-side Ruby on Rails, Sinatra |
| **PHP** | ✅ Yes | ❌ No | Server-side PHP, content management |
| **Go** | ❌ No | — | Basic conversion only |
| **Java** | ❌ No | — | Basic conversion only |
| **C#** | ❌ No | — | Basic conversion only |
| **Elixir** | ❌ No | — | Basic conversion only |
| **WebAssembly** | ❌ No | — | Browser, Edge, Deno (FFI limitations) |

For WASM users needing visitor functionality, see [WASM Visitor Alternatives](./crates/html-to-markdown-wasm/README.md#visitor-pattern-support) for recommended approaches.

Performance & Benchmarking

Rust-powered core delivers 150–280 MB/s throughput (10-80× faster than pure Python alternatives). Includes benchmarking tools, memory profiling, streaming strategies, and optimization tips.

**[Performance Guide →](./examples/performance/)**

Tag Preservation

Keep specific HTML tags unconverted when Markdown isn't expressive enough. Useful for tables, SVG, custom elements, or when you need mixed HTML/Markdown output.

See language-specific documentation for `preserveTags` configuration.

Skipping Images

Skip all images during conversion using the `skip_images` option. Useful for text-only extraction or when you want to filter out visual content.

**Rust:**
```rust
use html_to_markdown_rs::{convert, ConversionOptions};

let options = ConversionOptions {
skip_images: true,
..Default::default()
};

let html = r#"

Text with pic image

"#;
let markdown = convert(html, Some(options))?;
// Output: "Text with image" (image tags are removed)
```

**Python:**
```python
from html_to_markdown import convert, ConversionOptions

options = ConversionOptions(skip_images=True)
markdown = convert(html, options)
```

**TypeScript/Node.js:**
```typescript
import { convert, ConversionOptions } from '@kreuzberg/html-to-markdown-node';

const options: ConversionOptions = {
skipImages: true,
};

const markdown = convert(html, options);
```

**Ruby:**
```ruby
require 'html_to_markdown'

options = HtmlToMarkdown::ConversionOptions.new(skip_images: true)
markdown = HtmlToMarkdown.convert(html, options)
```

**PHP:**
```php
use Goldziher\HtmlToMarkdown\HtmlToMarkdown;
use Goldziher\HtmlToMarkdown\Options;

$options = new Options(['skip_images' => true]);
$markdown = HtmlToMarkdown::convert($html, $options);
```

This option is available across all language bindings. When enabled, all `` tags and their associated markdown image syntax are removed from the output.

Secure by Default

Built-in HTML sanitization prevents XSS attacks and malicious content. Powered by ammonia with safe defaults. Configurable via `sanitize` options.

## Contributing

Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on:

- Setting up the development environment
- Running tests locally (Rust 95%+ coverage, language bindings 80%+)
- Submitting pull requests
- Reporting issues

All contributions must follow code quality standards enforced via pre-commit hooks (prek).

## License

MIT License – see [LICENSE](LICENSE) for details. You can use html-to-markdown freely in both commercial and closed-source products with no obligations, no viral effects, and no licensing restrictions.