https://github.com/agoodway/html2markdown

Convert HTML to Markdown with Elixir
https://github.com/agoodway/html2markdown

elixir html html2markdown markdown rag

Last synced: 3 months ago
JSON representation

Convert HTML to Markdown with Elixir

Host: GitHub
URL: https://github.com/agoodway/html2markdown
Owner: agoodway
License: mit
Created: 2024-06-20T21:06:30.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-09-12T09:34:50.000Z (4 months ago)
Last Synced: 2025-10-12T14:26:43.010Z (3 months ago)
Topics: elixir, html, html2markdown, markdown, rag
Language: Elixir
Homepage:
Size: 121 KB
Stars: 33
Watchers: 1
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Html2Markdown

[![Hex.pm](https://img.shields.io/hexpm/v/html2markdown.svg)](https://hex.pm/packages/html2markdown)

[![Hex Docs](https://img.shields.io/badge/hex-docs-purple.svg)](https://hexdocs.pm/html2markdown)

[![License](https://img.shields.io/hexpm/l/html2markdown.svg)](https://github.com/cpursley/html2markdown/blob/main/LICENSE)

[![CI](https://github.com/agoodway/html2markdown/workflows/CI/badge.svg)](https://github.com/agoodway/html2markdown/actions/workflows/ci.yml)

Convert HTML to clean, readable Markdown. Designed for content extraction, this library handles common HTML patterns while filtering out non-content elements like navigation and and scripts.

## Installation

Add `html2markdown` to your list of dependencies in `mix.exs`:

```elixir

def deps do

  [

    {:html2markdown, "~> 0.3.1"}

  ]

end

```

## Quick Start

```elixir

# Basic conversion

Html2Markdown.convert("
Hello World
Welcome to Elixir!")

# => "\n# Hello World\n\n\n\nWelcome to **Elixir**!\n"

# With custom options

Html2Markdown.convert(html, %{

  navigation_classes: ["nav", "menu", "custom-nav"],

  normalize_whitespace: true

})

```

## Features

- **Smart Content Extraction**: Automatically removes navigation, ads, and other non-content elements

- **HTML5 Support**: Handles modern semantic elements like ``, ``, ``

- **Table Conversion**: Converts HTML tables to clean Markdown tables

- **Entity Handling**: Properly decodes HTML entities (`&`, `<`, ` `, etc.)

- **Configurable**: Customize filtering and processing behavior

## Configuration Options

```elixir

Html2Markdown.convert(html, %{

  # CSS classes that identify navigation elements to remove

  navigation_classes: ["footer", "menu", "nav", "sidebar", "aside"],

  

  # HTML tags to filter out during conversion

  non_content_tags: ["script", "style", "form", "nav", ...],

  

  # Markdown flavor (currently :basic, future: :gfm, :commonmark)

  markdown_flavor: :basic,

  

  # Normalize whitespace (collapses multiple spaces, trims)

  normalize_whitespace: true

})

```

## Common Use Cases

### Web Scraping

Extract readable content from web pages:

```elixir

{:ok, %{body: html}} = Req.get!(url)

markdown = Html2Markdown.convert(html)

```

### Content Migration

Convert existing HTML content to Markdown:

```elixir

# Convert blog posts from HTML to Markdown

html_content

|> Html2Markdown.convert(%{normalize_whitespace: true})

|> save_as_markdown()

```

### Email Processing

Clean up HTML emails for plain text storage:

```elixir

email_html

|> Html2Markdown.convert(%{

  non_content_tags: ["style", "script", "meta"],

  navigation_classes: ["unsubscribe", "footer"]

})

```

## Supported Elements

- **Headings**: `
` through `
`

- **Text**: Paragraphs, emphasis (``, ``), strong (``, ``)

- **Lists**: Ordered and unordered lists with nesting

- **Links**: `` tags with proper URL handling

- **Images**: `` and `` elements

- **Code**: Both inline `` and block `` elements

- **Tables**: Full table support with headers

- **Quotes**: `` and `` elements

- **HTML5**: ``, ``, ``, ``, ``, ``, ``

## Documentation

Full documentation is available at [https://hexdocs.pm/html2markdown](https://hexdocs.pm/html2markdown).

## Development

This project includes comprehensive testing and quality assurance tools:

### Running Tests

```bash

# Run all tests

mix test

# Run tests with coverage

mix coveralls.html

```

### Code Quality

```bash

# Run all quality checks (formatting, security, linting)

mix quality

# Individual checks

mix format --check-formatted  # Code formatting

mix credo --only warning       # Code linting

mix sobelow --config          # Security analysis

```

### CI/CD

This project uses GitHub Actions for continuous integration with:

- Multi-version testing (Elixir 1.15-1.17, OTP 25-27)

- Code quality enforcement

- Security scanning

- Test coverage reporting

## License

MIT License - see [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/agoodway/html2markdown

Awesome Lists containing this project

README

Hello World

` through `