https://github.com/goldziher/html-to-markdown

HTML to markdown converter
https://github.com/goldziher/html-to-markdown

html-converter markdown-converter rag text-extraction text-processing

Last synced: 2 months ago
JSON representation

HTML to markdown converter

Host: GitHub
URL: https://github.com/goldziher/html-to-markdown
Owner: Goldziher
License: mit
Created: 2025-02-03T16:18:12.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-04-19T08:50:07.000Z (2 months ago)
Last Synced: 2025-04-19T15:03:49.415Z (2 months ago)
Topics: html-converter, markdown-converter, rag, text-extraction, text-processing
Language: Python
Homepage:
Size: 383 KB
Stars: 30
Watchers: 1
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # html-to-markdown

A modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork

of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for

Python 3.9+.

## Features

- Full type safety with strict MyPy adherence

- Functional API design

- Extensive test coverage

- Configurable conversion options

- CLI tool for easy conversions

- Support for pre-configured BeautifulSoup instances

- Strict semver versioning

## Installation

```shell

pip install html-to-markdown

```

## Quick Start

Convert HTML to Markdown with a single function call:

```python

from html_to_markdown import convert_to_markdown

html = """

    
Welcome

    This is a sample with a link.

    

        Item 1

        Item 2

    


"""

markdown = convert_to_markdown(html)

print(markdown)

```

Output:

```markdown

# Welcome

This is a **sample** with a [link](https://example.com).

* Item 1

* Item 2

```

### Working with BeautifulSoup

If you need more control over HTML parsing, you can pass a pre-configured BeautifulSoup instance:

```python

from bs4 import BeautifulSoup

from html_to_markdown import convert_to_markdown

# Configure BeautifulSoup with your preferred parser

soup = BeautifulSoup(html, "lxml")  # Note: lxml requires additional installation

markdown = convert_to_markdown(soup)

```

## Advanced Usage

### Customizing Conversion Options

The library offers extensive customization through various options:

```python

from html_to_markdown import convert_to_markdown

html = "
Your content here..."

markdown = convert_to_markdown(

    html,

    heading_style="atx",  # Use # style headers

    strong_em_symbol="*",  # Use * for bold/italic

    bullets="*+-",  # Define bullet point characters

    wrap=True,  # Enable text wrapping

    wrap_width=100,  # Set wrap width

    escape_asterisks=True,  # Escape * characters

    code_language="python",  # Default code block language

)

```

### Custom Converters

You can provide your own conversion functions for specific HTML tags:

```python

from bs4.element import Tag

from html_to_markdown import convert_to_markdown

# Define a custom converter for the  tag

def custom_bold_converter(*, tag: Tag, text: str, **kwargs) -> str:

    return f"IMPORTANT: {text}"


html = "
This is a bold statement."

markdown = convert_to_markdown(html, custom_converters={"b": custom_bold_converter})

print(markdown)

# Output: This is a IMPORTANT: bold statement.

```

Custom converters take precedence over the built-in converters and can be used alongside other configuration options.

### Configuration Options

| Option               | Type | Default        | Description                                            |

| -------------------- | ---- | -------------- | ------------------------------------------------------ |

| `autolinks`          | bool | `True`         | Auto-convert URLs to Markdown links                    |

| `bullets`            | str  | `'*+-'`        | Characters to use for bullet points                    |

| `code_language`      | str  | `''`           | Default language for code blocks                       |

| `heading_style`      | str  | `'underlined'` | Header style (`'underlined'`, `'atx'`, `'atx_closed'`) |

| `escape_asterisks`   | bool | `True`         | Escape * characters                                    |

| `escape_underscores` | bool | `True`         | Escape _ characters                                    |

| `wrap`               | bool | `False`        | Enable text wrapping                                   |

| `wrap_width`         | int  | `80`           | Text wrap width                                        |

For a complete list of options, see the [Configuration](#configuration) section below.

## CLI Usage

Convert HTML files directly from the command line:

```shell

# Convert a file

html_to_markdown input.html > output.md

# Process stdin

cat input.html | html_to_markdown > output.md

# Use custom options

html_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md

```

View all available options:

```shell

html_to_markdown --help

```

## Migration from Markdownify

For existing projects using Markdownify, a compatibility layer is provided:

```python

# Old code

from markdownify import markdownify as md

# New code - works the same way

from html_to_markdown import markdownify as md

```

The `markdownify` function is an alias for `convert_to_markdown` and provides identical functionality.

## Configuration

Full list of configuration options:

- `autolinks`: Convert valid URLs to Markdown links automatically

- `bullets`: Characters to use for bullet points in lists

- `code_language`: Default language for fenced code blocks

- `code_language_callback`: Function to determine code block language

- `convert`: List of HTML tags to convert (None = all supported tags)

- `default_title`: Use default titles for elements like links

- `escape_asterisks`: Escape * characters

- `escape_misc`: Escape miscellaneous Markdown characters

- `escape_underscores`: Escape _ characters

- `heading_style`: Header style (underlined/atx/atx_closed)

- `keep_inline_images_in`: Tags where inline images should be kept

- `newline_style`: Style for handling newlines (spaces/backslash)

- `strip`: Tags to remove from output

- `strong_em_symbol`: Symbol for strong/emphasized text (\* or \_)

- `sub_symbol`: Symbol for subscript text

- `sup_symbol`: Symbol for superscript text

- `wrap`: Enable text wrapping

- `wrap_width`: Width for text wrapping

- `convert_as_inline`: Treat content as inline elements

- `custom_converters`: A mapping of HTML tag names to custom converter functions

## Contribution

This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before

submitting PRs to avoid disappointment.

### Local Development

1. Clone the repo

1. Install the system dependencies

1. Install the full dependencies with `uv sync`

1. Install the pre-commit hooks with:

    ```shell

    pre-commit install && pre-commit install --hook-type commit-msg

    ```

1. Make your changes and submit a PR

## License

This library uses the MIT license.

## Acknowledgments

Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/goldziher/html-to-markdown

Awesome Lists containing this project

README

Welcome