{"id":13490837,"url":"https://github.com/rushter/selectolax","last_synced_at":"2025-05-13T19:16:08.220Z","repository":{"id":37502692,"uuid":"112110982","full_name":"rushter/selectolax","owner":"rushter","description":"Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).","archived":false,"fork":false,"pushed_at":"2025-02-22T11:40:17.000Z","size":396,"stargazers_count":1253,"open_issues_count":23,"forks_count":73,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-04-27T20:07:28.633Z","etag":null,"topics":["css","html5","modest-engine","parser","python","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Cython","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rushter.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGES.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2017-11-26T19:37:37.000Z","updated_at":"2025-04-26T05:27:39.000Z","dependencies_parsed_at":"2023-11-07T11:23:34.693Z","dependency_job_id":"433b5668-9910-4c2b-8504-7b575fc1cc60","html_url":"https://github.com/rushter/selectolax","commit_stats":{"total_commits":408,"total_committers":19,"mean_commits":"21.473684210526315","dds":0.07107843137254899,"last_synced_commit":"469ba6ed5ec6d564c11f9157ee2f6e08b2c22533"},"previous_names":[],"tags_count":45,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rushter%2Fselectolax","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rushter%2Fselectolax/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rushter%2Fselectolax/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rushter%2Fselectolax/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rushter","download_url":"https://codeload.github.com/rushter/selectolax/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254010823,"owners_count":21999003,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["css","html5","modest-engine","parser","python","web-scraping"],"created_at":"2024-07-31T19:00:51.392Z","updated_at":"2025-05-13T19:16:08.166Z","avatar_url":"https://github.com/rushter.png","language":"Cython","readme":".. image:: docs/logo.png\n  :alt: selectolax logo\n\n-------------------------\n\n.. image:: https://img.shields.io/pypi/v/selectolax.svg\n        :target: https://pypi.python.org/pypi/selectolax\n\nA fast HTML5 parser with CSS selectors using `Modest \u003chttps://github.com/lexborisov/Modest/\u003e`_ and\n`Lexbor \u003chttps://github.com/lexbor/lexbor\u003e`_ engines.\n\n\nInstallation\n------------\nFrom PyPI using pip:\n\n.. code-block:: bash\n\n        pip install selectolax\n\nIf installation fails due to compilation errors, you may need to install `Cython \u003chttps://github.com/cython/cython\u003e`_:\n\n.. code-block:: bash\n\n        pip install selectolax[cython]\n\nThis usually happens when you try to install an outdated version of selectolax on a newer version of Python.\n\n\nDevelopment version from GitHub:\n\n.. code-block:: bash\n\n        git clone --recursive  https://github.com/rushter/selectolax\n        cd selectolax\n        pip install -r requirements_dev.txt\n        python setup.py install\n\nHow to compile selectolax while developing:\n\n.. code-block:: bash\n\n    make clean\n    make dev\n\nBasic examples\n--------------\n\nHere are some basic examples to get you started with selectolax:\n\nParsing HTML and extracting text:\n\n.. code:: python\n\n    In [1]: from selectolax.parser import HTMLParser\n       ...:\n       ...: html = \"\"\"\n       ...: \u003ch1 id=\"title\" data-updated=\"20201101\"\u003eHi there\u003c/h1\u003e\n       ...: \u003cdiv class=\"post\"\u003eLorem Ipsum is simply dummy text of the printing and typesetting industry. \u003c/div\u003e\n       ...: \u003cdiv class=\"post\"\u003eLorem ipsum dolor sit amet, consectetur adipiscing elit.\u003c/div\u003e\n       ...: \"\"\"\n       ...: tree = HTMLParser(html)\n\n    In [2]: tree.css_first('h1#title').text()\n    Out[2]: 'Hi there'\n\n    In [3]: tree.css_first('h1#title').attributes\n    Out[3]: {'id': 'title', 'data-updated': '20201101'}\n\n    In [4]: [node.text() for node in tree.css('.post')]\n    Out[4]:\n    ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. ',\n     'Lorem ipsum dolor sit amet, consectetur adipiscing elit.']\n\nUsing advanced CSS selectors:\n\n.. code:: python\n\n    In [1]: html = \"\u003cdiv\u003e\u003cp id=p1\u003e\u003cp id=p2\u003e\u003cp id=p3\u003e\u003ca\u003elink\u003c/a\u003e\u003cp id=p4\u003e\u003cp id=p5\u003etext\u003cp id=p6\u003e\u003c/div\u003e\"\n       ...: selector = \"div \u003e :nth-child(2n+1):not(:has(a))\"\n\n    In [2]: for node in HTMLParser(html).css(selector):\n       ...:     print(node.attributes, node.text(), node.tag)\n       ...:     print(node.parent.tag)\n       ...:     print(node.html)\n       ...:\n    {'id': 'p1'}  p\n    div\n    \u003cp id=\"p1\"\u003e\u003c/p\u003e\n    {'id': 'p5'} text p\n    div\n    \u003cp id=\"p5\"\u003etext\u003c/p\u003e\n\n\n* `Detailed overview \u003chttps://github.com/rushter/selectolax/blob/master/examples/walkthrough.ipynb\u003e`_\n\nAvailable backends\n------------------\n\nSelectolax supports two backends: ``Modest`` and ``Lexbor``. By default, all examples use the Modest backend.\nMost of the features between backends are almost identical, but there are still some differences.\n\nAs of 2024, the preferred backend is ``Lexbor``. The ``Modest`` backend is still available for compatibility reasons\nand the underlying C library that selectolax uses is not maintained anymore.\n\n\nTo use ``lexbor``, just import the parser and use it in the similar way to the `HTMLParser`.\n\n.. code:: python\n\n    In [1]: from selectolax.lexbor import LexborHTMLParser\n\n    In [2]: html = \"\"\"\n       ...: \u003ctitle\u003eHi there\u003c/title\u003e\n       ...: \u003cdiv id=\"updated\"\u003e2021-08-15\u003c/div\u003e\n       ...: \"\"\"\n\n    In [3]: parser = LexborHTMLParser(html)\n    In [4]: parser.root.css_first(\"#updated\").text()\n    Out[4]: '2021-08-15'\n\n\nSimple Benchmark\n----------------\n\n* Extract title, links, scripts and a meta tag from main pages of top 754 domains. See ``examples/benchmark.py`` for more information.\n\n============================ ===========\nPackage                       Time\n============================ ===========\nBeautiful Soup (html.parser)  61.02 sec.\nlxml / Beautiful Soup (lxml)  9.09 sec.\nhtml5_parser                  16.10 sec.\nselectolax (Modest)           2.94 sec.\nselectolax (Lexbor)           2.39 sec.\n============================ ===========\n\nLinks\n-----\n\n*  `selectolax API reference \u003chttp://selectolax.readthedocs.io/en/latest/parser.html\u003e`_\n*  `Video introduction to web scraping using selectolax \u003chttps://youtu.be/HpRsfpPuUzE\u003e`_\n*  `How to Scrape 7k Products with Python using selectolax and httpx \u003chttps://www.youtube.com/watch?v=XpGvq755J2U\u003e`_\n*  `Detailed overview \u003chttps://github.com/rushter/selectolax/blob/master/examples/walkthrough.ipynb\u003e`_\n*  `Modest introduction \u003chttps://lexborisov.github.io/Modest/\u003e`_\n*  `Modest benchmark \u003chttp://lexborisov.github.io/benchmark-html-persers/\u003e`_\n*  `Python benchmark \u003chttps://rushter.com/blog/python-fast-html-parser/\u003e`_\n*  `Another Python benchmark \u003chttps://www.peterbe.com/plog/selectolax-or-pyquery\u003e`_\n\nLicense\n-------\n\n* Modest engine — `LGPL2.1 \u003chttps://github.com/lexborisov/Modest/blob/master/LICENSE\u003e`_\n* selectolax - `MIT \u003chttps://github.com/rushter/selectolax/blob/master/LICENSE\u003e`_\n","funding_links":[],"categories":["Cython","🧩 HTML \u0026 XML Parsing","HTML Processing","\u003ca name=\"data-extraction\"\u003e\u003c/a\u003e⛏️ Data Extraction"],"sub_categories":["Ruby"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frushter%2Fselectolax","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frushter%2Fselectolax","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frushter%2Fselectolax/lists"}