https://github.com/kreuzberg-dev/html-to-markdown
High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.
https://github.com/kreuzberg-dev/html-to-markdown
hocr html html-converter markdown markdown-converter rag text-extraction text-processing
Last synced: 5 days ago
JSON representation
High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.
- Host: GitHub
- URL: https://github.com/kreuzberg-dev/html-to-markdown
- Owner: kreuzberg-dev
- License: mit
- Created: 2025-02-03T16:18:12.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2026-05-23T21:11:57.000Z (10 days ago)
- Last Synced: 2026-05-23T22:14:45.246Z (10 days ago)
- Topics: hocr, html, html-converter, markdown, markdown-converter, rag, text-extraction, text-processing
- Language: HTML
- Homepage:
- Size: 113 MB
- Stars: 732
- Watchers: 7
- Forks: 58
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Citation: CITATION.cff
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
# html-to-markdown
High-performance HTML to Markdown conversion powered by Rust. Ships as native bindings for **Rust, Python, TypeScript/Node.js, Ruby, PHP, Go, Java, C#, Elixir, R, C (FFI), and WebAssembly** with identical rendering across all runtimes.
**[Documentation](https://docs.html-to-markdown.kreuzberg.dev)** | **[API Reference](https://docs.rs/html-to-markdown-rs/)**
## Highlights
- **Rust-native throughput** with html5ever parsing
- **12 language bindings** with consistent output across all runtimes
- **Structured result** — `convert()` returns `ConversionResult` with `content`, `metadata`, `tables`, `images`, and `warnings`
- **Metadata extraction** — title, headers, links, images, structured data (JSON-LD, Microdata, RDFa)
- **Visitor pattern** — custom callbacks for content filtering, URL rewriting, domain-specific dialects
- **Table extraction** — extract structured table data (cells, headers, rendered markdown) during conversion
- **Secure by default** — built-in HTML sanitization via ammonia
## Quick Start
```bash
# Rust
cargo add html-to-markdown-rs
# Python
pip install html-to-markdown
# TypeScript / Node.js
npm install @kreuzberg/html-to-markdown-node
# Ruby
gem install html-to-markdown
# CLI
cargo install html-to-markdown-cli
# or
brew install kreuzberg-dev/tap/html-to-markdown
```
See the package READMEs for all languages including PHP, Go, Java, C#, Elixir, R, and WASM.
### Usage
`convert()` is the single entry point. It returns a structured `ConversionResult`:
```python
# Python
from html_to_markdown import convert
result = convert("
Hello
World
")
print(result.content) # # Hello\n\nWorld
print(result.metadata) # title, links, headings, …
```
```typescript
// TypeScript / Node.js
import { convert } from "@kreuzberg/html-to-markdown";
const result = convert("
Hello
World
");
console.log(result.content); // # Hello\n\nWorld
console.log(result.metadata); // title, links, headings, …
```
```rust
// Rust
use html_to_markdown_rs::convert;
let result = convert("
Hello
World
", None)?;
println!("{}", result.content.unwrap_or_default());
```
## Language Bindings
| Language | Package | Install |
| -------------------- | ------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
| Rust | [html-to-markdown-rs](https://crates.io/crates/html-to-markdown-rs) | `cargo add html-to-markdown-rs` |
| Python | [html-to-markdown](https://pypi.org/project/html-to-markdown/) | `pip install html-to-markdown` |
| TypeScript / Node.js | [@kreuzberg/html-to-markdown-node](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node) | `npm install @kreuzberg/html-to-markdown-node` |
| WebAssembly | [@kreuzberg/html-to-markdown-wasm](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-wasm) | `npm install @kreuzberg/html-to-markdown-wasm` |
| Ruby | [html-to-markdown](https://rubygems.org/gems/html-to-markdown) | `gem install html-to-markdown` |
| PHP | [kreuzberg-dev/html-to-markdown](https://packagist.org/packages/kreuzberg-dev/html-to-markdown) | `composer require kreuzberg-dev/html-to-markdown` |
| Go | [htmltomarkdown](https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/v3/htmltomarkdown) | `go get github.com/kreuzberg-dev/html-to-markdown/packages/go/v3` |
| Java | [dev.kreuzberg:html-to-markdown](https://central.sonatype.com/artifact/dev.kreuzberg/html-to-markdown) | Maven / Gradle |
| C# | [KreuzbergDev.HtmlToMarkdown](https://www.nuget.org/packages/KreuzbergDev.HtmlToMarkdown/) | `dotnet add package KreuzbergDev.HtmlToMarkdown` |
| Elixir | [html_to_markdown](https://hex.pm/packages/html_to_markdown) | `mix deps.get html_to_markdown` |
| R | [htmltomarkdown](https://kreuzberg-dev.r-universe.dev/htmltomarkdown) | `install.packages("htmltomarkdown")` |
| C (FFI) | [releases](https://github.com/kreuzberg-dev/html-to-markdown/releases) | Pre-built `.so` / `.dll` / `.dylib` |
## Part of Kreuzberg.dev
- [Kreuzberg](https://github.com/kreuzberg-dev/kreuzberg) — document intelligence: text, tables, metadata from 90+ formats with optional OCR.
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces all per-language bindings.
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
## Contributing
Contributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for setup instructions and guidelines.
## License
MIT License — see [LICENSE](LICENSE) for details.