An open API service indexing awesome lists of open source software.

https://github.com/mikewolfd/html5ever_normalizer

A limited python binding for Rust's html5ever library
https://github.com/mikewolfd/html5ever_normalizer

Last synced: about 1 year ago
JSON representation

A limited python binding for Rust's html5ever library

Awesome Lists containing this project

README

          

# html5ever_normalizer

A proof of concept Python binding for the Rust html5ever library that normalizes and validates HTML into a complete, well-structured document.

> This package was developed using [Cursor](https://cursor.sh/) and Claude 3.5 Sonnet.

## Features

- Normalizes any HTML input into a complete, valid HTML5 document
- Automatically adds required structure (html, head, body tags)
- Fixes malformed markup and unclosed tags
- Preserves and normalizes DOCTYPE declarations
- Fast HTML5 parsing using Rust's html5ever
- Support for different quirks modes (limited by default, full, or no-quirks)

## Goals
- Fully implement html5ever's interface
- Integrate with lxml

## Installation

### From PyPI (Recommended)
```bash
pip install html5ever-normalizer
```

### From GitHub Releases (Pre-built wheels)
You can download pre-built wheels for your platform from the [GitHub Releases page](https://github.com/yourusername/html5ever_normalizer/releases). These wheels are available for:
- Linux (x86_64, aarch64)
- macOS (x86_64 and arm64, compatible with macOS 10.14+)

Python versions 3.10, 3.11, and 3.12 are supported.

### From GitHub Source
```bash
pip install git+https://github.com/yourusername/html5ever_normalizer.git
```

#### System Requirements for Source Installation
When installing from source, you'll need:
- Rust toolchain (install from https://rustup.rs)
- Python 3.8 or later
- A C compiler:
- Linux: GCC (usually pre-installed)
- macOS: Xcode Command Line Tools
- Windows: Microsoft Visual Studio Build Tools

### For Development
```bash
# Clone the repository
git clone https://github.com/yourusername/html5ever_normalizer.git
cd html5ever_normalizer

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate

# Install development dependencies
pip install -r requirements-dev.txt

# Install the package in editable mode
maturin develop
```

## Usage

```python
from html5ever_normalizer import parse_html

# Any input is normalized into a complete HTML document
html = '

Hello World

'
result = parse_html(html)
print(result)
# Output:
#
#

Hello World

# Malformed HTML is automatically fixed
html = '

Unclosed div'
result = parse_html(html)
print(result)
# Output:
#
#
Unclosed div

# Fragment inputs are properly structured
html = 'Just some text'
result = parse_html(html)
print(result)
# Output:
#
# Just some text

# DOCTYPE is preserved but normalized
html = ''
result = parse_html(html)
print(result)
# Output:
#
#

# Quirks mode can be specified
result = parse_html(html, quirks_mode='quirks') # 'limited' (default), 'quirks', or 'no-quirks'
```

### HTML Normalization

The library always produces a complete, valid HTML5 document. This means:

1. A normalized DOCTYPE declaration (``)
2. Required structural elements:
- `` root element
- `` section (even if empty)
- `` section
3. Proper nesting and closing of all tags
4. Handling of HTML fragments by placing them in the appropriate context
5. Consistent output structure regardless of input format

### Quirks Mode

The `parse_html` function accepts a `quirks_mode` parameter that can be one of:
- `'limited'` (default): Limited quirks mode for modern compatibility
- `'quirks'`: Full quirks mode for legacy compatibility
- `'no-quirks'`: Standard HTML5 parsing

## Requirements

- Python 3.8 or later
- Rust toolchain (for building from source)

## License

MIT License. See [LICENSE](LICENSE) for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.