An open API service indexing awesome lists of open source software.

https://github.com/oxylabs/lxml-tutorial

A tutorial on parsing webpages with lxml
https://github.com/oxylabs/lxml-tutorial

lxml parser python

Last synced: 6 months ago
JSON representation

A tutorial on parsing webpages with lxml

Awesome Lists containing this project

README

          

# lxml Tutorial: XML Processing and Web Scraping With lxml

[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)

[![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq)

[](https://github.com/topics/lxml) [](https://github.com/topics/web-scraping)

- [Installation](#installation)
- [Creating a simple XML document](#creating-a-simple-xml-document)
- [The Element class](#the-element-class)
- [The SubElement class](#the-subelement-class)
- [Setting text and attributes](#setting-text-and-attributes)
- [Parse an XML file using LXML in Python](#parse-an-xml-file-using-lxml-in-python)
- [Finding elements in XML](#finding-elements-in-xml)
- [Handling HTML with lxml.html](#handling-html-with-lxmlhtml)
- [lxml web scraping tutorial](#lxml-web-scraping-tutorial)
- [Conclusion](#conclusion)

In this lxml Python tutorial, we will explore the lxml library. We will go through the basics of creating XML documents and then jump on processing XML and HTML documents. Finally, we will put together all the pieces and see how to extract data using lxml.

For a detailed explanation, see our [blog post](https://oxy.yt/BrAk).

## Installation

The best way to download and install the lxml library is to use the pip package manager. This works on Windows, Mac, and Linux:

```shell
pip3 install lxml
```

## Creating a simple XML document

A very simple XML document would look like this:

```xml






```

## The Element class

To create an XML document using python lxml, the first step is to import the `etree` module of lxml:

```python
>>> from lxml import etree
```

In this example, we will create an HTML, which is XML compliant. It means that the root element will have its name as html:

```python
>>> root = etree.Element("html")
```

Similarly, every html will have a head and a body:

```python
>>> head = etree.Element("head")
>>> body = etree.Element("body")
```

To create parent-child relationships, we can simply use the append() method.

```python
>>> root.append(head)
>>> root.append(body)
```

This document can be serialized and printed to the terminal with the help of `tostring()` function:

```python
>>> print(etree.tostring(root, pretty_print=True).decode())
```

## The SubElement class

Creating an `Element` object and calling the `append()` function can make the code messy and unreadable. The easiest way is to use the `SubElement` type:

```python
body = etree.Element("body")
root.append(body)

# is same as

body = etree.SubElement(root,"body")
```

## Setting text and attributes

Here are the examples:

```python
para = etree.SubElement(body, "p")
para.text="Hello World!"
```

Similarly, attributes can be set using key-value convention:

```python
para.set("style", "font-size:20pt")
```

One thing to note here is that the attribute can be passed in the constructor of SubElement:

```python
para = etree.SubElement(body, "p", style="font-size:20pt", id="firstPara")
para.text = "Hello World!"
```

Here is the complete code:

```python
from lxml import etree

root = etree.Element("html")
head = etree.SubElement(root, "head")
title = etree.SubElement(head, "title")
title.text = "This is Page Title"
body = etree.SubElement(root, "body")
heading = etree.SubElement(body, "h1", style="font-size:20pt", id="head")
heading.text = "Hello World!"
para = etree.SubElement(body, "p", id="firstPara")
para.text = "This HTML is XML Compliant!"
para = etree.SubElement(body, "p", id="secondPara")
para.text = "This is the second paragraph."

etree.dump(root) # prints everything to console. Use for debug only
```

Add the following lines at the bottom of the snippet and run it again:

```python
with open(‘input.html’, ‘wb’) as f:
f.write(etree.tostring(root, pretty_print=True)
```

## Parse an XML file using LXML in Python

Save the following snippet as input.html.

```html


This is Page Title


Hello World!


This HTML is XML Compliant!


This is the second paragraph.


```

To get the root element, simply call the `getroot()` method.

```python
from lxml import etree

tree = etree.parse('input.html')
elem = tree.getroot()
etree.dump(elem) #prints file contents to console
```

The lxml.etree module exposes another method that can be used to parse contents from a valid xml string — `fromstring()`

```python
xml = 'Hello'
root = etree.fromstring(xml)
etree.dump(root)
```

If you want to dig deeper into parsing, we have already written a tutorial on [BeautifulSoup](https://oxylabs.io/blog/beautiful-soup-parsing-tutorial), a Python package used for parsing HTML and XML documents.

## Finding elements in XML

Broadly, there are two ways of finding elements using the Python lxml library. The first is by using the Python lxml querying languages: XPath and ElementPath.

```python
tree = etree.parse('input.html')
elem = tree.getroot()
para = elem.find('body/p')
etree.dump(para)

# Output
#

This HTML is XML Compliant!


```

Similarly, `findall()` will return a list of all the elements matching the selector.

```python
elem = tree.getroot()
para = elem.findall('body/p')
for e in para:
etree.dump(e)

# Outputs
#

This HTML is XML Compliant!


#

This is the second paragraph.


```

Another way of selecting the elements is by using XPath directly

```python
para = elem.xpath('//p/text()')
for e in para:
print(e)

# Output
# This HTML is XML Compliant!
# This is the second paragraph.
```

## Handling HTML with lxml.html

Here is the code to print all paragraphs from the same HTML file.

```python
from lxml import html
with open('input.html') as f:
html_string = f.read()
tree = html.fromstring(html_string)
para = tree.xpath('//p/text()')
for e in para:
print(e)

# Output
# This HTML is XML Compliant!
# This is the second paragraph
```

## lxml web scraping tutorial

Now that we know how to parse and find elements in XML and HTML, the only missing piece is getting the HTML of a web page.

For this, the Requests library is a great choice:

```
pip install requests
```

Once the requests library is installed, HTML of any web page can be retrieved using `get()` method. Here is an example.

```python
import requests

response = requests.get('http://books.toscrape.com/')
print(response.text)
# prints source HTML
```

Here is a quick example that prints a list of countries from Wikipedia:

```python
import requests
from lxml import html

response = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2010')

tree = html.fromstring(response.text)
countries = tree.xpath('//span[@class="flagicon"]')
for country in countries:
print(country.xpath('./following-sibling::a/text()')[0])
```

The following modified code prints the country name and image URL of the flag.

```python
for country in countries:
flag = country.xpath('./img/@src')[0]
country = country.xpath('./following-sibling::a/text()')[0]
print(country, flag)
```

## Conclusion

If you wish to find out more about XML Processing and Web Scraping With lxml, see our [blog post](https://oxy.yt/BrAk).