https://github.com/oxylabs/lxml-tutorial

A tutorial on parsing webpages with lxml
https://github.com/oxylabs/lxml-tutorial
lxml parser python
Last synced: 6 months ago
JSON representation
A tutorial on parsing webpages with lxml
Host: GitHub
URL: https://github.com/oxylabs/lxml-tutorial
Owner: oxylabs
Created: 2022-03-14T08:47:41.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2025-02-11T12:48:16.000Z (8 months ago)
Last Synced: 2025-03-29T22:12:04.371Z (6 months ago)
Topics: lxml, parser, python
Language: Python
Homepage:
Size: 20.5 KB
Stars: 5
Watchers: 0
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # lxml Tutorial: XML Processing and Web Scraping With lxml

[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)

[![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq)

[](https://github.com/topics/lxml) [](https://github.com/topics/web-scraping)

- [Installation](#installation)

- [Creating a simple XML document](#creating-a-simple-xml-document)

- [The Element class](#the-element-class)

- [The SubElement class](#the-subelement-class)

- [Setting text and attributes](#setting-text-and-attributes)

- [Parse an XML file using LXML in Python](#parse-an-xml-file-using-lxml-in-python)

- [Finding elements in XML](#finding-elements-in-xml)

- [Handling HTML with lxml.html](#handling-html-with-lxmlhtml)

- [lxml web scraping tutorial](#lxml-web-scraping-tutorial)

- [Conclusion](#conclusion)

In this lxml Python tutorial, we will explore the lxml library. We will go through the basics of creating XML documents and then jump on processing XML and HTML documents. Finally, we will put together all the pieces and see how to extract data using lxml. 

For a detailed explanation, see our [blog post](https://oxy.yt/BrAk).

## Installation

The best way to download and install the lxml library is to use the pip package manager. This works on Windows, Mac, and Linux:

```shell

pip3 install lxml

```

## Creating a simple XML document

A very simple XML document would look like this:

```xml

    

        

        

        

        

    

```

## The Element class

To create an XML document using python lxml, the first step is to import the `etree` module of lxml:

```python

>>> from lxml import etree

```

In this example, we will create an HTML, which is XML compliant. It means that the root element will have its name as html:

```python

>>> root = etree.Element("html")

```

Similarly, every html will have a head and a body:

```python

>>> head = etree.Element("head")

>>> body = etree.Element("body")

```

To create parent-child relationships, we can simply use the append() method.

```python

>>> root.append(head)

>>> root.append(body)

```

This document can be serialized and printed to the terminal with the help of `tostring()` function:

```python

>>> print(etree.tostring(root, pretty_print=True).decode())

```

## The SubElement class

Creating an `Element` object and calling the `append()` function can make the code messy and unreadable. The easiest way is to use the `SubElement` type:

```python

body = etree.Element("body")

root.append(body)

# is same as 

body = etree.SubElement(root,"body")

```

## Setting text and attributes

Here are the examples:

```python

para = etree.SubElement(body, "p")

para.text="Hello World!"

```

Similarly, attributes can be set using key-value convention:

```python

para.set("style", "font-size:20pt")

```

One thing to note here is that the attribute can be passed in the constructor of SubElement:

```python

para = etree.SubElement(body, "p", style="font-size:20pt", id="firstPara")

para.text = "Hello World!"

```

Here is the complete code:

```python

from lxml import etree

 

root = etree.Element("html")

head = etree.SubElement(root, "head")

title = etree.SubElement(head, "title")

title.text = "This is Page Title"

body = etree.SubElement(root, "body")

heading = etree.SubElement(body, "h1", style="font-size:20pt", id="head")

heading.text = "Hello World!"

para = etree.SubElement(body, "p",  id="firstPara")

para.text = "This HTML is XML Compliant!"

para = etree.SubElement(body, "p",  id="secondPara")

para.text = "This is the second paragraph."

 

etree.dump(root)  # prints everything to console. Use for debug only

```

Add the following lines at the bottom of the snippet and run it again:

```python

with open(‘input.html’, ‘wb’) as f:

    f.write(etree.tostring(root, pretty_print=True)

```

## Parse an XML file using LXML in Python

Save the following snippet as input.html.

```html

  

    This is Page Title

  

  

    
Hello World!

    This HTML is XML Compliant!

    This is the second paragraph.

  

```

To get the root element, simply call the `getroot()` method.

```python

from lxml import etree

 

tree = etree.parse('input.html')

elem = tree.getroot()

etree.dump(elem) #prints file contents to console

```

The lxml.etree module exposes another method that can be used to parse contents from a valid xml string — `fromstring()`

```python

xml = 'Hello'

root = etree.fromstring(xml)

etree.dump(root)

```

If you want to dig deeper into parsing, we have already written a tutorial on [BeautifulSoup](https://oxylabs.io/blog/beautiful-soup-parsing-tutorial), a Python package used for parsing HTML and XML documents. 

## Finding elements in XML

Broadly, there are two ways of finding elements using the Python lxml library. The first is by using the Python lxml querying languages: XPath and ElementPath.

```python

tree = etree.parse('input.html')

elem = tree.getroot()

para = elem.find('body/p')

etree.dump(para)

 

# Output 

# 
This HTML is XML Compliant!

```

Similarly, `findall()` will return a list of all the elements matching the selector.

```python

elem = tree.getroot()

para = elem.findall('body/p')

for e in para:

    etree.dump(e)

 

# Outputs

# 
This HTML is XML Compliant!

# This is the second paragraph.

```

Another way of selecting the elements is by using XPath directly

```python

para = elem.xpath('//p/text()')

for e in para:

    print(e)

 

# Output

# This HTML is XML Compliant!

# This is the second paragraph.

```

## Handling HTML with lxml.html

Here is the code to print all paragraphs from the same HTML file.

```python

from lxml import html

with open('input.html') as f:

    html_string = f.read()

tree = html.fromstring(html_string)

para = tree.xpath('//p/text()')

for e in para:

    print(e)

 

# Output

# This HTML is XML Compliant!

# This is the second paragraph

```

## lxml web scraping tutorial 

Now that we know how to parse and find elements in XML and HTML, the only missing piece is getting the HTML of a web page.

For this, the Requests library is a great choice:

```

pip install requests

```

Once the requests library is installed, HTML of any web page can be retrieved using  `get()` method. Here is an example.

```python

import requests

 

response = requests.get('http://books.toscrape.com/')

print(response.text)

# prints source HTML

```

Here is a quick example that prints a list of countries from Wikipedia:

```python

import requests

from lxml import html

 

response = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2010')

 

tree = html.fromstring(response.text)

countries = tree.xpath('//span[@class="flagicon"]')

for country in countries:

    print(country.xpath('./following-sibling::a/text()')[0])

```

The following modified code prints the country name and image URL of the flag.

```python

for country in countries:

    flag = country.xpath('./img/@src')[0]

    country = country.xpath('./following-sibling::a/text()')[0]

    print(country, flag)

```

## Conclusion

If you wish to find out more about XML Processing and Web Scraping With lxml, see our [blog post](https://oxy.yt/BrAk).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/oxylabs/lxml-tutorial

Awesome Lists containing this project

README

Hello World!