{"id":25227700,"url":"https://github.com/mikewolfd/html5ever_normalizer","last_synced_at":"2025-04-05T13:15:46.114Z","repository":{"id":275821378,"uuid":"927291195","full_name":"mikewolfd/html5ever_normalizer","owner":"mikewolfd","description":"A limited python binding for Rust's html5ever library","archived":false,"fork":false,"pushed_at":"2025-02-04T19:06:20.000Z","size":18,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-04T19:34:19.953Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mikewolfd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-04T18:07:17.000Z","updated_at":"2025-02-04T19:06:24.000Z","dependencies_parsed_at":"2025-02-04T19:34:53.225Z","dependency_job_id":"54b44f5c-fa84-4dfb-8710-fbb8e8ce43eb","html_url":"https://github.com/mikewolfd/html5ever_normalizer","commit_stats":null,"previous_names":["mikewolfd/html5ever_normalizer"],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikewolfd%2Fhtml5ever_normalizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikewolfd%2Fhtml5ever_normalizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikewolfd%2Fhtml5ever_normalizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikewolfd%2Fhtml5ever_normalizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mikewolfd","download_url":"https://codeload.github.com/mikewolfd/html5ever_normalizer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247339157,"owners_count":20923014,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-11T09:08:46.898Z","updated_at":"2025-04-05T13:15:46.089Z","avatar_url":"https://github.com/mikewolfd.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# html5ever_normalizer\n\nA proof of concept Python binding for the Rust html5ever library that normalizes and validates HTML into a complete, well-structured document.\n\n\u003e This package was developed using [Cursor](https://cursor.sh/) and Claude 3.5 Sonnet.\n\n## Features\n\n- Normalizes any HTML input into a complete, valid HTML5 document\n- Automatically adds required structure (html, head, body tags)\n- Fixes malformed markup and unclosed tags\n- Preserves and normalizes DOCTYPE declarations\n- Fast HTML5 parsing using Rust's html5ever\n- Support for different quirks modes (limited by default, full, or no-quirks)\n\n## Goals\n- Fully implement html5ever's interface\n- Integrate with lxml\n  \n## Installation\n\n### From PyPI (Recommended)\n```bash\npip install html5ever-normalizer\n```\n\n### From GitHub Releases (Pre-built wheels)\nYou can download pre-built wheels for your platform from the [GitHub Releases page](https://github.com/yourusername/html5ever_normalizer/releases). These wheels are available for:\n- Linux (x86_64, aarch64)\n- macOS (x86_64 and arm64, compatible with macOS 10.14+)\n\nPython versions 3.10, 3.11, and 3.12 are supported.\n\n### From GitHub Source\n```bash\npip install git+https://github.com/yourusername/html5ever_normalizer.git\n```\n\n#### System Requirements for Source Installation\nWhen installing from source, you'll need:\n- Rust toolchain (install from https://rustup.rs)\n- Python 3.8 or later\n- A C compiler:\n  - Linux: GCC (usually pre-installed)\n  - macOS: Xcode Command Line Tools\n  - Windows: Microsoft Visual Studio Build Tools\n\n### For Development\n```bash\n# Clone the repository\ngit clone https://github.com/yourusername/html5ever_normalizer.git\ncd html5ever_normalizer\n\n# Create and activate a virtual environment\npython -m venv .venv\nsource .venv/bin/activate  # On Windows: .venv\\Scripts\\activate\n\n# Install development dependencies\npip install -r requirements-dev.txt\n\n# Install the package in editable mode\nmaturin develop\n```\n\n## Usage\n\n```python\nfrom html5ever_normalizer import parse_html\n\n# Any input is normalized into a complete HTML document\nhtml = '\u003cp\u003eHello World\u003c/p\u003e'\nresult = parse_html(html)\nprint(result)\n# Output:\n# \u003c!DOCTYPE html\u003e\n# \u003chtml\u003e\u003chead\u003e\u003c/head\u003e\u003cbody\u003e\u003cp\u003eHello World\u003c/p\u003e\u003c/body\u003e\u003c/html\u003e\n\n# Malformed HTML is automatically fixed\nhtml = '\u003cdiv\u003eUnclosed div'\nresult = parse_html(html)\nprint(result)\n# Output:\n# \u003c!DOCTYPE html\u003e\n# \u003chtml\u003e\u003chead\u003e\u003c/head\u003e\u003cbody\u003e\u003cdiv\u003eUnclosed div\u003c/div\u003e\u003c/body\u003e\u003c/html\u003e\n\n# Fragment inputs are properly structured\nhtml = 'Just some text'\nresult = parse_html(html)\nprint(result)\n# Output:\n# \u003c!DOCTYPE html\u003e\n# \u003chtml\u003e\u003chead\u003e\u003c/head\u003e\u003cbody\u003eJust some text\u003c/body\u003e\u003c/html\u003e\n\n# DOCTYPE is preserved but normalized\nhtml = '\u003c!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\"\u003e'\nresult = parse_html(html)\nprint(result)\n# Output:\n# \u003c!DOCTYPE html\u003e\n# \u003chtml\u003e\u003chead\u003e\u003c/head\u003e\u003cbody\u003e\u003c/body\u003e\u003c/html\u003e\n\n# Quirks mode can be specified\nresult = parse_html(html, quirks_mode='quirks')  # 'limited' (default), 'quirks', or 'no-quirks'\n```\n\n### HTML Normalization\n\nThe library always produces a complete, valid HTML5 document. This means:\n\n1. A normalized DOCTYPE declaration (`\u003c!DOCTYPE html\u003e`)\n2. Required structural elements:\n   - `\u003chtml\u003e` root element\n   - `\u003chead\u003e` section (even if empty)\n   - `\u003cbody\u003e` section\n3. Proper nesting and closing of all tags\n4. Handling of HTML fragments by placing them in the appropriate context\n5. Consistent output structure regardless of input format\n\n### Quirks Mode\n\nThe `parse_html` function accepts a `quirks_mode` parameter that can be one of:\n- `'limited'` (default): Limited quirks mode for modern compatibility\n- `'quirks'`: Full quirks mode for legacy compatibility\n- `'no-quirks'`: Standard HTML5 parsing\n\n## Requirements\n\n- Python 3.8 or later\n- Rust toolchain (for building from source)\n\n## License\n\nMIT License. See [LICENSE](LICENSE) for details.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmikewolfd%2Fhtml5ever_normalizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmikewolfd%2Fhtml5ever_normalizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmikewolfd%2Fhtml5ever_normalizer/lists"}