https://github.com/agoodway/html2markdown
Convert HTML to Markdown with Elixir
https://github.com/agoodway/html2markdown
elixir html html2markdown markdown rag
Last synced: 3 months ago
JSON representation
Convert HTML to Markdown with Elixir
- Host: GitHub
- URL: https://github.com/agoodway/html2markdown
- Owner: agoodway
- License: mit
- Created: 2024-06-20T21:06:30.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-09-12T09:34:50.000Z (4 months ago)
- Last Synced: 2025-10-12T14:26:43.010Z (3 months ago)
- Topics: elixir, html, html2markdown, markdown, rag
- Language: Elixir
- Homepage:
- Size: 121 KB
- Stars: 33
- Watchers: 1
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Html2Markdown
[](https://hex.pm/packages/html2markdown)
[](https://hexdocs.pm/html2markdown)
[](https://github.com/cpursley/html2markdown/blob/main/LICENSE)
[](https://github.com/agoodway/html2markdown/actions/workflows/ci.yml)
Convert HTML to clean, readable Markdown. Designed for content extraction, this library handles common HTML patterns while filtering out non-content elements like navigation and and scripts.
## Installation
Add `html2markdown` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:html2markdown, "~> 0.3.1"}
]
end
```
## Quick Start
```elixir
# Basic conversion
Html2Markdown.convert("
Hello World
Welcome to Elixir!
")
# => "\n# Hello World\n\n\n\nWelcome to **Elixir**!\n"
# With custom options
Html2Markdown.convert(html, %{
navigation_classes: ["nav", "menu", "custom-nav"],
normalize_whitespace: true
})
```
## Features
- **Smart Content Extraction**: Automatically removes navigation, ads, and other non-content elements
- **HTML5 Support**: Handles modern semantic elements like ``, ``, `
## Configuration Options
```elixir
Html2Markdown.convert(html, %{
# CSS classes that identify navigation elements to remove
navigation_classes: ["footer", "menu", "nav", "sidebar", "aside"],
# HTML tags to filter out during conversion
non_content_tags: ["script", "style", "form", "nav", ...],
# Markdown flavor (currently :basic, future: :gfm, :commonmark)
markdown_flavor: :basic,
# Normalize whitespace (collapses multiple spaces, trims)
normalize_whitespace: true
})
```
## Common Use Cases
### Web Scraping
Extract readable content from web pages:
```elixir
{:ok, %{body: html}} = Req.get!(url)
markdown = Html2Markdown.convert(html)
```
### Content Migration
Convert existing HTML content to Markdown:
```elixir
# Convert blog posts from HTML to Markdown
html_content
|> Html2Markdown.convert(%{normalize_whitespace: true})
|> save_as_markdown()
```
### Email Processing
Clean up HTML emails for plain text storage:
```elixir
email_html
|> Html2Markdown.convert(%{
non_content_tags: ["style", "script", "meta"],
navigation_classes: ["unsubscribe", "footer"]
})
```
## Supported Elements
- **Headings**: `
` through ``
- **Text**: Paragraphs, emphasis (``, ``), strong (``, ``)
- **Lists**: Ordered and unordered lists with nesting
- **Links**: `` tags with proper URL handling
- **Images**: `
` and `` elements
- **Code**: Both inline `` and block `` elements
- **Tables**: Full table support with headers
- **Quotes**: `` and `` elements
- **HTML5**: ``, ``, ``, ``, ``, `