https://github.com/rushter/selectolax
Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
https://github.com/rushter/selectolax
css html5 modest-engine parser python web-scraping
Last synced: 6 months ago
JSON representation
Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
- Host: GitHub
- URL: https://github.com/rushter/selectolax
- Owner: rushter
- License: mit
- Created: 2017-11-26T19:37:37.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2025-02-22T11:40:17.000Z (8 months ago)
- Last Synced: 2025-04-27T20:07:28.633Z (6 months ago)
- Topics: css, html5, modest-engine, parser, python, web-scraping
- Language: Cython
- Homepage:
- Size: 387 KB
- Stars: 1,253
- Watchers: 15
- Forks: 73
- Open Issues: 23
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.rst
- License: LICENSE
Awesome Lists containing this project
- best-of-web-python - GitHub - 4% open · ⏱️ 28.09.2025): (HTML Processing)
- awesome - rushter/selectolax - Python binding to Modest and Lexbor engines. Fast HTML5 parser with CSS selectors for Python. (Cython)
README
.. image:: docs/logo.png
:alt: selectolax logo
-------------------------
.. image:: https://img.shields.io/pypi/v/selectolax.svg
:target: https://pypi.python.org/pypi/selectolax
A fast HTML5 parser with CSS selectors using `Modest `_ and
`Lexbor `_ engines.
Installation
------------
From PyPI using pip:
.. code-block:: bash
pip install selectolax
If installation fails due to compilation errors, you may need to install `Cython `_:
.. code-block:: bash
pip install selectolax[cython]
This usually happens when you try to install an outdated version of selectolax on a newer version of Python.
Development version from GitHub:
.. code-block:: bash
git clone --recursive https://github.com/rushter/selectolax
cd selectolax
pip install -r requirements_dev.txt
python setup.py install
How to compile selectolax while developing:
.. code-block:: bash
make clean
make dev
Basic examples
--------------
Here are some basic examples to get you started with selectolax:
Parsing HTML and extracting text:
.. code:: python
In [1]: from selectolax.parser import HTMLParser
...:
...: html = """
...:
Hi there
...: Lorem Ipsum is simply dummy text of the printing and typesetting industry.
...: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
...: """
...: tree = HTMLParser(html)
In [2]: tree.css_first('h1#title').text()
Out[2]: 'Hi there'
In [3]: tree.css_first('h1#title').attributes
Out[3]: {'id': 'title', 'data-updated': '20201101'}
In [4]: [node.text() for node in tree.css('.post')]
Out[4]:
['Lorem Ipsum is simply dummy text of the printing and typesetting industry. ',
'Lorem ipsum dolor sit amet, consectetur adipiscing elit.']
Using advanced CSS selectors:
.. code:: python
In [1]: html = "
"
...: selector = "div > :nth-child(2n+1):not(:has(a))"
In [2]: for node in HTMLParser(html).css(selector):
...: print(node.attributes, node.text(), node.tag)
...: print(node.parent.tag)
...: print(node.html)
...:
{'id': 'p1'} p
div
{'id': 'p5'} text p
div
text
* `Detailed overview `_
Available backends
------------------
Selectolax supports two backends: ``Modest`` and ``Lexbor``. By default, all examples use the Modest backend.
Most of the features between backends are almost identical, but there are still some differences.
As of 2024, the preferred backend is ``Lexbor``. The ``Modest`` backend is still available for compatibility reasons
and the underlying C library that selectolax uses is not maintained anymore.
To use ``lexbor``, just import the parser and use it in the similar way to the `HTMLParser`.
.. code:: python
In [1]: from selectolax.lexbor import LexborHTMLParser
In [2]: html = """
...: Hi there
...:
2021-08-15
...: """
In [3]: parser = LexborHTMLParser(html)
In [4]: parser.root.css_first("#updated").text()
Out[4]: '2021-08-15'
Simple Benchmark
----------------
* Extract title, links, scripts and a meta tag from main pages of top 754 domains. See ``examples/benchmark.py`` for more information.
============================ ===========
Package Time
============================ ===========
Beautiful Soup (html.parser) 61.02 sec.
lxml / Beautiful Soup (lxml) 9.09 sec.
html5_parser 16.10 sec.
selectolax (Modest) 2.94 sec.
selectolax (Lexbor) 2.39 sec.
============================ ===========
Links
-----
* `selectolax API reference `_
* `Video introduction to web scraping using selectolax `_
* `How to Scrape 7k Products with Python using selectolax and httpx `_
* `Detailed overview `_
* `Modest introduction `_
* `Modest benchmark `_
* `Python benchmark `_
* `Another Python benchmark `_
License
-------
* Modest engine — `LGPL2.1 `_
* selectolax - `MIT `_