Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/html5lib/html5lib-python
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
https://github.com/html5lib/html5lib-python
Last synced: about 4 hours ago
JSON representation
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
- Host: GitHub
- URL: https://github.com/html5lib/html5lib-python
- Owner: html5lib
- License: mit
- Created: 2013-04-09T14:07:42.000Z (almost 12 years ago)
- Default Branch: master
- Last Pushed: 2024-02-27T19:49:36.000Z (11 months ago)
- Last Synced: 2025-01-07T21:02:59.212Z (7 days ago)
- Language: Python
- Homepage:
- Size: 6.54 MB
- Stars: 1,146
- Watchers: 50
- Forks: 286
- Open Issues: 88
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.rst
- Contributing: CONTRIBUTING.rst
- License: LICENSE
- Authors: AUTHORS.rst
Awesome Lists containing this project
- awesome-python-resources - GitHub - 34% open · ⏱️ 17.09.2021): (HTML 处理)
- best-of-web-python - GitHub - 31% open · ⏱️ 21.02.2024): (HTML Processing)
README
html5lib
========.. image:: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml/badge.svg
:target: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.ymlhtml5lib is a pure-python library for parsing HTML. It is designed to
conform to the WHATWG HTML specification, as is implemented by all major
web browsers.Usage
-----Simple usage follows this pattern:
.. code-block:: python
import html5lib
with open("mydocument.html", "rb") as f:
document = html5lib.parse(f)or:
.. code-block:: python
import html5lib
document = html5lib.parse("Hello World!")
By default, the ``document`` will be an ``xml.etree`` element instance.
Whenever possible, html5lib chooses the accelerated ``ElementTree``
implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x).Two other tree types are supported: ``xml.dom.minidom`` and
``lxml.etree``. To use an alternative format, specify the name of
a treebuilder:.. code-block:: python
import html5lib
with open("mydocument.html", "rb") as f:
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")When using with ``urllib2`` (Python 2), the charset from HTTP should be
pass into html5lib as follows:.. code-block:: python
from contextlib import closing
from urllib2 import urlopen
import html5libwith closing(urlopen("http://example.com/")) as f:
document = html5lib.parse(f, transport_encoding=f.info().getparam("charset"))When using with ``urllib.request`` (Python 3), the charset from HTTP
should be pass into html5lib as follows:.. code-block:: python
from urllib.request import urlopen
import html5libwith urlopen("http://example.com/") as f:
document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())To have more control over the parser, create a parser object explicitly.
For instance, to make the parser raise exceptions on parse errors, use:.. code-block:: python
import html5lib
with open("mydocument.html", "rb") as f:
parser = html5lib.HTMLParser(strict=True)
document = parser.parse(f)When you're instantiating parser objects explicitly, pass a treebuilder
class as the ``tree`` keyword argument to use an alternative document
format:.. code-block:: python
import html5lib
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
minidom_document = parser.parse("Hello World!")
More documentation is available at https://html5lib.readthedocs.io/.
Installation
------------html5lib works on CPython 2.7+, CPython 3.5+ and PyPy. To install:
.. code-block:: bash
$ pip install html5lib
The goal is to support a (non-strict) superset of the versions that `pip
supports
`_.Optional Dependencies
---------------------The following third-party libraries may be used for additional
functionality:- ``lxml`` is supported as a tree format (for both building and
walking) under CPython (but *not* PyPy where it is known to cause
segfaults);- ``genshi`` has a treewalker (but not builder); and
- ``chardet`` can be used as a fallback when character encoding cannot
be determined.Bugs
----Please report any bugs on the `issue tracker
`_.Tests
-----Unit tests require the ``pytest`` and ``mock`` libraries and can be
run using the ``pytest`` command in the root directory.Test data are contained in a separate `html5lib-tests
`_ repository and included
as a submodule, thus for git checkouts they must be initialized::$ git submodule init
$ git submodule updateIf you have all compatible Python implementations available on your
system, you can run tests on all of them using the ``tox`` utility,
which can be found on PyPI.Questions?
----------Check out `the docs `_. Still
need help? Go to our `GitHub Discussions
`_.You can also browse the archives of the `html5lib-discuss mailing list
`_.