https://github.com/html5lib/html5lib-python

Standards-compliant library for parsing and serializing HTML documents and fragments in Python
https://github.com/html5lib/html5lib-python

Last synced: about 1 year ago
JSON representation

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

Host: GitHub
URL: https://github.com/html5lib/html5lib-python
Owner: html5lib
License: mit
Created: 2013-04-09T14:07:42.000Z (over 13 years ago)
Default Branch: master
Last Pushed: 2024-02-27T19:49:36.000Z (over 2 years ago)
Last Synced: 2025-04-28T11:52:30.025Z (about 1 year ago)
Language: Python
Homepage:
Size: 6.54 MB
Stars: 1,189
Watchers: 50
Forks: 294
Open Issues: 90
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.rst
- Contributing: CONTRIBUTING.rst
- License: LICENSE
- Authors: AUTHORS.rst

Awesome Lists containing this project

fucking-awesome-python-cn - html5lib
awesome-python - html5lib - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)
awesome-python-zh - html5lib - 一个符合标准的库，用于解析和序列化HTML文档和片段。 (HTML操作)
awesome-python - html5lib - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)
best-of-web-python - GitHub - 31% open · ⏱️ 21.02.2024): (HTML Processing)
python-awesome - html5lib - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)
awesome-python - html5lib - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)
awesome-python-resources - GitHub - 34% open · ⏱️ 17.09.2021): (HTML 处理)
awesome-python - html5lib - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)
awesome-python - html5lib - Standards-compliant library for parsing and serializing HTML documents and fragments in Python ` 📝 a month ago ` (HTML Manipulation [🔝](#readme))
awesome-python - html5lib - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)
fucking-awesome-python - html5lib - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)
fucking-awesome-python - :octocat: html5lib - :star: 1041 :fork_and_knife: 279 - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)
awesome-python - html5lib - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)
awesome-python-cn - html5lib
Python-Awesome - html5lib - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)
awesome-python - html5lib - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)
awesome-python - html5lib-python - Standards-compliant library for parsing and serializing HTML documents and fragments in Python (Awesome Python / HTML Manipulation)
awesome-python - html5lib - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)
git-github.com-vinta-awesome-python - html5lib - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)
fucking_awesome_python - html5lib - A standards-compliant library for parsing and serializing HTML documents and fragments. (HTML Manipulation)

README

          html5lib

========

.. image:: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml/badge.svg

    :target: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml

html5lib is a pure-python library for parsing HTML. It is designed to

conform to the WHATWG HTML specification, as is implemented by all major

web browsers.

Usage

-----

Simple usage follows this pattern:

.. code-block:: python

  import html5lib

  with open("mydocument.html", "rb") as f:

      document = html5lib.parse(f)

or:

.. code-block:: python

  import html5lib

  document = html5lib.parse("
Hello World!")

By default, the ``document`` will be an ``xml.etree`` element instance.

Whenever possible, html5lib chooses the accelerated ``ElementTree``

implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x).

Two other tree types are supported: ``xml.dom.minidom`` and

``lxml.etree``. To use an alternative format, specify the name of

a treebuilder:

.. code-block:: python

  import html5lib

  with open("mydocument.html", "rb") as f:

      lxml_etree_document = html5lib.parse(f, treebuilder="lxml")

When using with ``urllib2`` (Python 2), the charset from HTTP should be

pass into html5lib as follows:

.. code-block:: python

  from contextlib import closing

  from urllib2 import urlopen

  import html5lib

  with closing(urlopen("http://example.com/")) as f:

      document = html5lib.parse(f, transport_encoding=f.info().getparam("charset"))

When using with ``urllib.request`` (Python 3), the charset from HTTP

should be pass into html5lib as follows:

.. code-block:: python

  from urllib.request import urlopen

  import html5lib

  with urlopen("http://example.com/") as f:

      document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())

To have more control over the parser, create a parser object explicitly.

For instance, to make the parser raise exceptions on parse errors, use:

.. code-block:: python

  import html5lib

  with open("mydocument.html", "rb") as f:

      parser = html5lib.HTMLParser(strict=True)

      document = parser.parse(f)

When you're instantiating parser objects explicitly, pass a treebuilder

class as the ``tree`` keyword argument to use an alternative document

format:

.. code-block:: python

  import html5lib

  parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))

  minidom_document = parser.parse("
Hello World!")

More documentation is available at https://html5lib.readthedocs.io/.

Installation

------------

html5lib works on CPython 2.7+, CPython 3.5+ and PyPy. To install:

.. code-block:: bash

    $ pip install html5lib

The goal is to support a (non-strict) superset of the versions that `pip

supports

`_.

Optional Dependencies

---------------------

The following third-party libraries may be used for additional

functionality:

- ``lxml`` is supported as a tree format (for both building and

  walking) under CPython (but *not* PyPy where it is known to cause

  segfaults);

- ``genshi`` has a treewalker (but not builder); and

- ``chardet`` can be used as a fallback when character encoding cannot

  be determined.

Bugs

----

Please report any bugs on the `issue tracker

`_.

Tests

-----

Unit tests require the ``pytest`` and ``mock`` libraries and can be

run using the ``pytest`` command in the root directory.

Test data are contained in a separate `html5lib-tests

`_ repository and included

as a submodule, thus for git checkouts they must be initialized::

  $ git submodule init

  $ git submodule update

If you have all compatible Python implementations available on your

system, you can run tests on all of them using the ``tox`` utility,

which can be found on PyPI.

Questions?

----------

Check out `the docs `_. Still

need help? Go to our `GitHub Discussions

`_.

You can also browse the archives of the `html5lib-discuss mailing list 

`_.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/html5lib/html5lib-python

Awesome Lists containing this project

README