https://github.com/rushter/selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
https://github.com/rushter/selectolax

css html5 modest-engine parser python web-scraping

Last synced: about 1 year ago
JSON representation

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).

Host: GitHub
URL: https://github.com/rushter/selectolax
Owner: rushter
License: mit
Created: 2017-11-26T19:37:37.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2025-02-22T11:40:17.000Z (over 1 year ago)
Last Synced: 2025-04-27T20:07:28.633Z (over 1 year ago)
Topics: css, html5, modest-engine, parser, python, web-scraping
Language: Cython
Homepage:
Size: 387 KB
Stars: 1,253
Watchers: 15
Forks: 73
Open Issues: 23
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.rst
- License: LICENSE

Awesome Lists containing this project

awesome-scrapers - selectolax - 30x faster than Beautiful Soup using Lexbor engine. | (🧩 HTML & XML Parsing / Ruby)
best-of-web-python - GitHub - 4% open · ⏱️ 11.10.2025): (HTML Processing)
awesome-python-rs - selectolax - Fast HTML5 parser with CSS selectors, using Rust's html5ever engine. (Web Scraping & HTML)
awesome-web-scraping - Selectolax - fast HTML5 parser using Modest engine | (<a name="data-extraction"></a>⛏️ Data Extraction)
awesome - rushter/selectolax - Python binding to Modest and Lexbor engines. Fast HTML5 parser with CSS selectors for Python. (Cython)

README

          .. image:: docs/logo.png

  :alt: selectolax logo

-------------------------

.. image:: https://img.shields.io/pypi/v/selectolax.svg

        :target: https://pypi.python.org/pypi/selectolax

A fast HTML5 parser with CSS selectors using `Modest `_ and

`Lexbor `_ engines.

Installation

------------

From PyPI using pip:

.. code-block:: bash

        pip install selectolax

If installation fails due to compilation errors, you may need to install `Cython `_:

.. code-block:: bash

        pip install selectolax[cython]

This usually happens when you try to install an outdated version of selectolax on a newer version of Python.

Development version from GitHub:

.. code-block:: bash

        git clone --recursive  https://github.com/rushter/selectolax

        cd selectolax

        pip install -r requirements_dev.txt

        python setup.py install

How to compile selectolax while developing:

.. code-block:: bash

    make clean

    make dev

Basic examples

--------------

Here are some basic examples to get you started with selectolax:

Parsing HTML and extracting text:

.. code:: python

    In [1]: from selectolax.parser import HTMLParser

       ...:

       ...: html = """

       ...: 
Hi there

       ...: Lorem Ipsum is simply dummy text of the printing and typesetting industry. 

       ...: Lorem ipsum dolor sit amet, consectetur adipiscing elit.

       ...: """

       ...: tree = HTMLParser(html)

    In [2]: tree.css_first('h1#title').text()

    Out[2]: 'Hi there'

    In [3]: tree.css_first('h1#title').attributes

    Out[3]: {'id': 'title', 'data-updated': '20201101'}

    In [4]: [node.text() for node in tree.css('.post')]

    Out[4]:

    ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. ',

     'Lorem ipsum dolor sit amet, consectetur adipiscing elit.']

Using advanced CSS selectors:

.. code:: python

    In [1]: html = "






link



text



"

       ...: selector = "div > :nth-child(2n+1):not(:has(a))"

    In [2]: for node in HTMLParser(html).css(selector):

       ...:     print(node.attributes, node.text(), node.tag)

       ...:     print(node.parent.tag)

       ...:     print(node.html)

       ...:

    {'id': 'p1'}  p

    div

    


    {'id': 'p5'} text p

    div

    text


* `Detailed overview `_

Available backends

------------------

Selectolax supports two backends: ``Modest`` and ``Lexbor``. By default, all examples use the Modest backend.

Most of the features between backends are almost identical, but there are still some differences.

As of 2024, the preferred backend is ``Lexbor``. The ``Modest`` backend is still available for compatibility reasons

and the underlying C library that selectolax uses is not maintained anymore.

To use ``lexbor``, just import the parser and use it in the similar way to the `HTMLParser`.

.. code:: python

    In [1]: from selectolax.lexbor import LexborHTMLParser

    In [2]: html = """

       ...: Hi there

       ...: 
2021-08-15

       ...: """

    In [3]: parser = LexborHTMLParser(html)

    In [4]: parser.root.css_first("#updated").text()

    Out[4]: '2021-08-15'

Simple Benchmark

----------------

* Extract title, links, scripts and a meta tag from main pages of top 754 domains. See ``examples/benchmark.py`` for more information.

============================ ===========

Package                       Time

============================ ===========

Beautiful Soup (html.parser)  61.02 sec.

lxml / Beautiful Soup (lxml)  9.09 sec.

html5_parser                  16.10 sec.

selectolax (Modest)           2.94 sec.

selectolax (Lexbor)           2.39 sec.

============================ ===========

Links

-----

*  `selectolax API reference `_

*  `Video introduction to web scraping using selectolax `_

*  `How to Scrape 7k Products with Python using selectolax and httpx `_

*  `Detailed overview `_

*  `Modest introduction `_

*  `Modest benchmark `_

*  `Python benchmark `_

*  `Another Python benchmark `_

License

-------

* Modest engine — `LGPL2.1 `_

* selectolax - `MIT `_

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rushter/selectolax

Awesome Lists containing this project

README

Hi there