{"id":20710083,"url":"https://github.com/oxylabs/lxml-tutorial","last_synced_at":"2026-03-06T12:03:04.468Z","repository":{"id":52110485,"uuid":"469649060","full_name":"oxylabs/lxml-tutorial","owner":"oxylabs","description":"A tutorial on parsing webpages with lxml ","archived":false,"fork":false,"pushed_at":"2025-09-25T07:52:52.000Z","size":28,"stargazers_count":5,"open_issues_count":0,"forks_count":4,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-25T09:28:26.251Z","etag":null,"topics":["lxml","parser","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oxylabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-03-14T08:47:41.000Z","updated_at":"2025-09-25T07:52:56.000Z","dependencies_parsed_at":"2025-04-23T04:48:05.514Z","dependency_job_id":"6b3e13d4-ca81-4d36-8ba4-ade74f7201aa","html_url":"https://github.com/oxylabs/lxml-tutorial","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/oxylabs/lxml-tutorial","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Flxml-tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Flxml-tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Flxml-tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Flxml-tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oxylabs","download_url":"https://codeload.github.com/oxylabs/lxml-tutorial/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Flxml-tutorial/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30175907,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T11:48:51.886Z","status":"ssl_error","status_checked_at":"2026-03-06T11:48:51.460Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lxml","parser","python"],"created_at":"2024-11-17T02:09:45.117Z","updated_at":"2026-03-06T12:03:04.456Z","avatar_url":"https://github.com/oxylabs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# lxml Tutorial: XML Processing and Web Scraping With lxml\n\n[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877\u0026utm_medium=affiliate\u0026groupid=877\u0026utm_content=lxml-tutorial-github\u0026transaction_id=102f49063ab94276ae8f116d224b67)\n\n[![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge\u0026theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge\u0026logo=youtube\u0026logoColor=white)](https://www.youtube.com/@oxylabs)\n\n[\u003cimg src=\"https://img.shields.io/static/v1?label=\u0026message=lxml\u0026color=brightgreen\" /\u003e](https://github.com/topics/lxml) [\u003cimg src=\"https://img.shields.io/static/v1?label=\u0026message=Web%20Scraping\u0026color=important\" /\u003e](https://github.com/topics/web-scraping)\n\n- [Installation](#installation)\n- [Creating a simple XML document](#creating-a-simple-xml-document)\n- [The Element class](#the-element-class)\n- [The SubElement class](#the-subelement-class)\n- [Setting text and attributes](#setting-text-and-attributes)\n- [Parse an XML file using LXML in Python](#parse-an-xml-file-using-lxml-in-python)\n- [Finding elements in XML](#finding-elements-in-xml)\n- [Handling HTML with lxml.html](#handling-html-with-lxmlhtml)\n- [lxml web scraping tutorial](#lxml-web-scraping-tutorial)\n- [Conclusion](#conclusion)\n\nIn this lxml Python tutorial, we will explore the lxml library. We will go through the basics of creating XML documents and then jump on processing XML and HTML documents. Finally, we will put together all the pieces and see how to extract data using lxml. \n\nFor a detailed explanation, see our [blog post](https://oxy.yt/BrAk).\n\n## Installation\n\nThe best way to download and install the lxml library is to use the pip package manager. This works on Windows, Mac, and Linux:\n\n```shell\npip3 install lxml\n```\n\n## Creating a simple XML document\n\nA very simple XML document would look like this:\n\n```xml\n\u003croot\u003e\n    \u003cbranch\u003e\n        \u003cbranch_one\u003e\n        \u003c/branch_one\u003e\n        \u003cbranch_one\u003e\n        \u003c/branch_one \u003e\n    \u003c/branch\u003e\n\u003c/root\u003e\n```\n\n## The Element class\n\nTo create an XML document using python lxml, the first step is to import the `etree` module of lxml:\n\n```python\n\u003e\u003e\u003e from lxml import etree\n```\n\nIn this example, we will create an HTML, which is XML compliant. It means that the root element will have its name as html:\n\n```python\n\u003e\u003e\u003e root = etree.Element(\"html\")\n```\n\nSimilarly, every html will have a head and a body:\n\n```python\n\u003e\u003e\u003e head = etree.Element(\"head\")\n\u003e\u003e\u003e body = etree.Element(\"body\")\n```\n\nTo create parent-child relationships, we can simply use the append() method.\n\n```python\n\u003e\u003e\u003e root.append(head)\n\u003e\u003e\u003e root.append(body)\n```\n\nThis document can be serialized and printed to the terminal with the help of `tostring()` function:\n\n```python\n\u003e\u003e\u003e print(etree.tostring(root, pretty_print=True).decode())\n```\n\n## The SubElement class\n\nCreating an `Element` object and calling the `append()` function can make the code messy and unreadable. The easiest way is to use the `SubElement` type:\n\n```python\nbody = etree.Element(\"body\")\nroot.append(body)\n\n# is same as \n\nbody = etree.SubElement(root,\"body\")\n```\n\n## Setting text and attributes\n\nHere are the examples:\n\n```python\npara = etree.SubElement(body, \"p\")\npara.text=\"Hello World!\"\n```\n\nSimilarly, attributes can be set using key-value convention:\n\n```python\npara.set(\"style\", \"font-size:20pt\")\n```\n\nOne thing to note here is that the attribute can be passed in the constructor of SubElement:\n\n```python\npara = etree.SubElement(body, \"p\", style=\"font-size:20pt\", id=\"firstPara\")\npara.text = \"Hello World!\"\n```\n\nHere is the complete code:\n\n```python\nfrom lxml import etree\n \nroot = etree.Element(\"html\")\nhead = etree.SubElement(root, \"head\")\ntitle = etree.SubElement(head, \"title\")\ntitle.text = \"This is Page Title\"\nbody = etree.SubElement(root, \"body\")\nheading = etree.SubElement(body, \"h1\", style=\"font-size:20pt\", id=\"head\")\nheading.text = \"Hello World!\"\npara = etree.SubElement(body, \"p\",  id=\"firstPara\")\npara.text = \"This HTML is XML Compliant!\"\npara = etree.SubElement(body, \"p\",  id=\"secondPara\")\npara.text = \"This is the second paragraph.\"\n \netree.dump(root)  # prints everything to console. Use for debug only\n```\n\nAdd the following lines at the bottom of the snippet and run it again:\n\n```python\nwith open(‘input.html’, ‘wb’) as f:\n    f.write(etree.tostring(root, pretty_print=True)\n```\n\n## Parse an XML file using LXML in Python\n\nSave the following snippet as input.html.\n\n```html\n\u003chtml\u003e\n  \u003chead\u003e\n    \u003ctitle\u003eThis is Page Title\u003c/title\u003e\n  \u003c/head\u003e\n  \u003cbody\u003e\n    \u003ch1 style=\"font-size:20pt\" id=\"head\"\u003eHello World!\u003c/h1\u003e\n    \u003cp id=\"firstPara\"\u003eThis HTML is XML Compliant!\u003c/p\u003e\n    \u003cp id=\"secondPara\"\u003eThis is the second paragraph.\u003c/p\u003e\n  \u003c/body\u003e\n\u003c/html\u003e\n```\n\nTo get the root element, simply call the `getroot()` method.\n\n```python\nfrom lxml import etree\n \ntree = etree.parse('input.html')\nelem = tree.getroot()\netree.dump(elem) #prints file contents to console\n```\n\nThe lxml.etree module exposes another method that can be used to parse contents from a valid xml string — `fromstring()`\n\n```python\nxml = '\u003chtml\u003e\u003cbody\u003eHello\u003c/body\u003e\u003c/html\u003e'\nroot = etree.fromstring(xml)\netree.dump(root)\n```\n\nIf you want to dig deeper into parsing, we have already written a tutorial on [BeautifulSoup](https://oxylabs.io/blog/beautiful-soup-parsing-tutorial), a Python package used for parsing HTML and XML documents. \n\n## Finding elements in XML\n\nBroadly, there are two ways of finding elements using the Python lxml library. The first is by using the Python lxml querying languages: XPath and ElementPath.\n\n```python\ntree = etree.parse('input.html')\nelem = tree.getroot()\npara = elem.find('body/p')\netree.dump(para)\n \n# Output \n# \u003cp id=\"firstPara\"\u003eThis HTML is XML Compliant!\u003c/p\u003e\n```\n\nSimilarly, `findall()` will return a list of all the elements matching the selector.\n\n```python\nelem = tree.getroot()\npara = elem.findall('body/p')\nfor e in para:\n    etree.dump(e)\n \n# Outputs\n# \u003cp id=\"firstPara\"\u003eThis HTML is XML Compliant!\u003c/p\u003e\n# \u003cp id=\"secondPara\"\u003eThis is the second paragraph.\u003c/p\u003e\n```\n\nAnother way of selecting the elements is by using XPath directly\n\n```python\npara = elem.xpath('//p/text()')\nfor e in para:\n    print(e)\n \n# Output\n# This HTML is XML Compliant!\n# This is the second paragraph.\n```\n\n## Handling HTML with lxml.html\n\nHere is the code to print all paragraphs from the same HTML file.\n\n```python\nfrom lxml import html\nwith open('input.html') as f:\n    html_string = f.read()\ntree = html.fromstring(html_string)\npara = tree.xpath('//p/text()')\nfor e in para:\n    print(e)\n \n# Output\n# This HTML is XML Compliant!\n# This is the second paragraph\n```\n\n## lxml web scraping tutorial \n\nNow that we know how to parse and find elements in XML and HTML, the only missing piece is getting the HTML of a web page.\n\nFor this, the Requests library is a great choice:\n\n```\npip install requests\n```\n\nOnce the requests library is installed, HTML of any web page can be retrieved using  `get()` method. Here is an example.\n\n```python\nimport requests\n \nresponse = requests.get('http://books.toscrape.com/')\nprint(response.text)\n# prints source HTML\n```\n\nHere is a quick example that prints a list of countries from Wikipedia:\n\n```python\nimport requests\nfrom lxml import html\n \nresponse = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2010')\n \ntree = html.fromstring(response.text)\ncountries = tree.xpath('//span[@class=\"flagicon\"]')\nfor country in countries:\n    print(country.xpath('./following-sibling::a/text()')[0])\n```\n\nThe following modified code prints the country name and image URL of the flag.\n\n```python\nfor country in countries:\n    flag = country.xpath('./img/@src')[0]\n    country = country.xpath('./following-sibling::a/text()')[0]\n    print(country, flag)\n```\n\n## Conclusion\n\nIf you wish to find out more about XML Processing and Web Scraping With lxml, see our [blog post](https://oxy.yt/BrAk).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Flxml-tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foxylabs%2Flxml-tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Flxml-tutorial/lists"}