https://github.com/oxylabs/custom-parser-instructions

Learn the fundamentals of writing parsing instructions with Oxylabs' Custom Parser.
https://github.com/oxylabs/custom-parser-instructions

parser parsing python scraping scraping-websites tutorial web-scraping

Last synced: 4 months ago
JSON representation

Learn the fundamentals of writing parsing instructions with Oxylabs' Custom Parser.

Host: GitHub
URL: https://github.com/oxylabs/custom-parser-instructions
Owner: oxylabs
Created: 2023-05-29T06:20:14.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-02-11T13:04:42.000Z (5 months ago)
Last Synced: 2025-02-11T14:23:12.731Z (5 months ago)
Topics: parser, parsing, python, scraping, scraping-websites, tutorial, web-scraping
Language: Python
Homepage:
Size: 1.77 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

# Custom Parser Instruction

[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)

[![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq)

# How to Write Parsing Instructions with Custom Parser?
[![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq)

- [The structure of parsing instructions](#the-structure-of-parsing-instructions)
- [How to write parsing instructions](#how-to-write-parsing-instructions)
* [Configuring the payload](#configuring-the-payload)
* [Parsing a single field using XPath](#parsing-a-single-field-using-xpath)
* [Parsing a single field using CSS selectors](#parsing-a-single-field-using-css-selectors)
* [Parsing multiple fields with separated results](#parsing-multiple-fields-with-separated-results)
* [Parsing multiple fields with categorized results](#parsing-multiple-fields-with-categorized-results)
- [Parsing example of a real target](#parsing-example-of-a-real-target)
* [Product listings](#product-listings)
* [Product page](#product-page)

Custom Parser is a free feature of Oxylabs [Scraper APIs](https://oxylabs.io/products/scraper-api), which allows you to write your own parsing instructions for a chosen target when needed. The Custom Parser feature expands your options and flexibility throughout the entire scraping process on any website.

With it, you can:

- Extract all text from an HTML document;

- Parse data using XPath and CSS expressions;

- Manipulate strings with pre-defined functions and regex expressions;

- Perform common string actions like conversion, indexing, and retrieving the length;

- Do mathematical calculations, such as calculating the average, finding the maximum and minimum values, and multiplying values.

This guide will teach you the fundamentals of writing custom parsing
instructions in Python and will showcase Custom Parser in action.

## The structure of parsing instructions

To start off, you should already have a basic grasp of Oxylabs Scraper
APIs. If you’re new to our web scraping solutions, you can familiarize
yourself by reading our [documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/custom-parser/getting-started).
Note that you can only use one parser simultaneously – either a
Dedicated Parser, Adaptive Parser, or Custom Parser.

In essence, the parsing instructions have to be specified in the payload
of the request, which is composed in a JSON format. Parsing instructions
consist of HTML node selection and value transformation functions.

You’re going to use XPath expressions or CSS selectors to select HTML nodes and extract data from them. We highly recommend reading our [blog post](https://oxylabs.io/blog/xpath-vs-css), where we introduce the
basics of using XPath and CSS selectors.

The two XPath functions of Custom Parser are `xpath`, which returns all
matches, and `xpath_one`, which returns the first match. Similarly, there are also two CSS functions you can use – `css` to get all matches and `css_one` to get only the first match. You can learn more about other functions in our [documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/custom-parser/list-of-functions).

The structure of parsing instructions can be summed up into four main
steps:

1. Name of a field that will store the results;

2. `_fns` array that holds all the specific parsing instructions for that field;

3. `_fn` function that defines the action;

4. `_args` variables that modify the behavior of the `_fn` associated with it.

The following code sample illustrates these steps:

```python
{
"parsing_instructions": {
"Result field name": { # 1.
"_fns": [ # 2.
{
"_fn": "What action to perform?", # 3.
"_args": ["How to perform the action?"] # 4.
}
]
}
}
}
```

## How to write parsing instructions

We’ll use a dummy bookstore website,
[books.toscrape.com](https://books.toscrape.com/catalogue/page-1.html),
to showcase several ways you can extract the desired information.

### Configuring the payload

First, define the necessary payload parameters for your specific needs,
then add the `"parse": True` parameter to enable parsing. Next, add the
`"parsing_instructions"` parameter to define the parsing instructions
within the curly brackets. So far, your payload should look similar to
this:
```python
payload = {
"source": "universal",
"url": "https://books.toscrape.com/catalogue/page-1.html",
"parse": True,
"parsing_instructions": {}
}
```
### Parsing a single field using XPath

Let’s start by gathering all the book titles from our [target
page](https://books.toscrape.com/catalogue/page-1.html). Create a
new JSON object and assign a new field, which will hold a list of all
the book titles. This field name will be displayed in the parsed result.
Let’s call it `"titles"`:

> **Note**
>
> When creating custom parameter names, you can’t use the
underscore symbol `_` at the very beginning.

```python
{
"parsing_instructions": {
"titles": {}
}
}
```
Next, let’s add the `_fns` array to define a data processing pipeline.
This property will hold all the instructions required to parse the book
titles from our target:
```python
{
"parsing_instructions": {
"titles": {
"_fns": []
}
}
}
```
Then, in the square brackets of the `_fns` field, add the `_fn` and
`_args` properties:
```python
{
"parsing_instructions": {
"titles": {
"_fns": [
{
"_fn": "",
"_args": [""]
}
]
}
}
}
```
In this section we’ll use XPath expressions to parse all the book titles. You can find an example of how to use CSS selectors below.

In order to get all the book titles, set `"_fn"` value to `"xpath"` and
provide one or more XPath expressions in the `"_args"` array. Please note
that the XPath expressions will be executed in the order they’re found
in the array. For instance, if the first XPath expression is valid (i.e.
the node exists), subsequent XPath expressions won’t be executed.

In this case, all the book titles are in the `` tags, which are
inside the `

` tag, so the XPath expression can be written as
`"//h3//a/text()"`. The `text()` in the XPath expression instructs the
parser to select only the textual values:

```python
import requests
from pprint import pprint

payload = {
"source": "universal",
"url": "https://books.toscrape.com/catalogue/page-1.html",
"parse": True,
"parsing_instructions": {
"titles": {
"_fns": [
{
"_fn": "xpath",
"_args": ["//h3//a/text()"]
}
]
}
}
}

response = requests.request(
"POST",
"https://realtime.oxylabs.io/v1/queries",
auth=("USERNAME", "PASSWORD"),
json=payload
)

pprint(response.json())
```
This code produces the following list of book titles:
```bash
{
"titles": [
"A Light in the ...",
"Tipping the Velvet",
"Soumission",
"Sharp Objects",
"Sapiens: A Brief History ...",
"The Requiem Red",
"The Dirty Little Secrets ...",
"The Coming Woman: A ...",
"The Boys in the ...",
"The Black Maria",
"Starving Hearts (Triangular Trade ...",
"Shakespeare's Sonnets",
"Set Me Free",
"Scott Pilgrim's Precious Little ...",
"Rip it Up and ...",
"Our Band Could Be ...",
"Olio",
"Mesaerion: The Best Science ...",
"Libertarianism for Beginners",
"It's Only the Himalayas"
]
}
```
### Parsing a single field using CSS selectors

Alternatively, the same result can be achieved using CSS selectors. To do that, set the `"_fn"` value to `"css"`, and provide one or more CSS expressions in the `"_args"` array.
To parse all the book titles from the target website, you can form the CSS expression as `"h3 > [title]"` since all the titles are inside the `title` attribute. Your parsing instructions should look like this:

```python
{
"parsing_instructions": {
"titles": {
"_fns": [
{
"_fn": "css",
"_args": ["h3 > [title]"]
}
]
}
}
}
```

Note that CSS expressions can only
select HTML elements, meaning they can’t directly extract the values. Hence, using the above code, the received response is a JSON array with HTML elements, including the opening and closing tags.
To extract the values, you can create another `"_fn"` function within the `"_fns"` array and use the `"element_text"` function of Custom Parser that extracts text and strips leading and trailing whitespaces:

```python
{
"parsing_instructions": {
"titles": {
"_fns": [
{
"_fn": "css",
"_args": ["h3 > [title]"]
},
{
"_fn": "element_text"
}
]
}
}
}
```

This time, the parsing instructions brought back only the text from the `title` attribute:

```bash
{
"titles": [
"A Light in the ...",
"Tipping the Velvet",
"Soumission",
"Sharp Objects",
"Sapiens: A Brief History ...",
"The Requiem Red",
"The Dirty Little Secrets ...",
"The Coming Woman: A ...",
"The Boys in the ...",
"The Black Maria",
"Starving Hearts (Triangular Trade ...",
"Shakespeare's Sonnets",
"Set Me Free",
"Scott Pilgrim's Precious Little ...",
"Rip it Up and ...",
"Our Band Could Be ...",
"Olio",
"Mesaerion: The Best Science ...",
"Libertarianism for Beginners",
"It's Only the Himalayas"
]
}
```

### Parsing multiple fields with separated results

Let’s include the book prices, which are in the `

` tag with an
attribute `class="price_color"`. You can separate the results by creating
another field that will hold the prices. The process is the same as
explained previously – you have to create another field called `"prices"`,
just like you did with the `"titles"`. The parsing instructions using XPath should be
as follows:
```python
{
"parsing_instructions": {
"titles": {
"_fns": [
{
"_fn": "xpath",
"_args": ["//h3//a/text()"]
}
]
},
"prices": {
"_fns": [
{
"_fn": "xpath",
"_args": ["//p[@class='price_color']/text()"]
}
]
}
}
}
```
The output will give you results separated by fields:

```bash
{
"prices": [
"£51.77",
"£53.74",

...

"£51.33",
"£45.17"
],
"titles": [
"A Light in the ...",
"Tipping the Velvet",

...

"Libertarianism for Beginners",
"It's Only the Himalayas"
]
}
```

The results can also be categorized by product, which we’ll overview
next.

### Parsing multiple fields with categorized results

Say you want to get the **titles**, **prices**, **availability**, and
the **URL** of all the books on page 1. Following the logic of the
previous parsing instructions, the results would be separated into
different fields, which may not be a preferred way to parse product
listings.

Custom Parser allows you to categorize the results by product. To do
that, you can first define the parsing scope of the HTML document and
iterate over it with the `"_items"` function. This function tells our
system that every field inside it, such as `"title"`, is a part of one
item and should be grouped together.

By defining the parsing scope, you’re telling the system to look only at
a specific part of the HTML document. All books are listed within the
`

` tags, which are under the `

When defining the parsing scope, use the `xpath` function for the` _fn`
property to find everything that matches the XPath expression. At this
moment, the code should look like this:
```python
{
"parsing_instructions": {
"products": {
"_fns": [
{
"_fn": "xpath",
"_args": ["//ol//li"]
}
]
}
}
}
```
Then, when using the `"_items"` property, use the `xpath_one` function to
find only the first match since the `"_items"` property will iterate over
the defined parsing scope, which finds all the matches. Let’s add the
**title**, **price**, **availability**, and **URL** fields to our code
inside the `"_items"` property:
```python
{
"parsing_instructions": {
"products": {
"_fns": [
{
"_fn": "xpath",
"_args": [
"//ol//li"
]
}
],
"_items": {
"title": {
"_fns": [
{
"_fn": "xpath_one",
"_args": [
".//h3//a/text()"
]
}
]
},
"price": {
"_fns": [
{
"_fn": "xpath_one",
"_args": [
".//p[@class='price_color']/text()"
]
}
]
},
"availability": {
"_fns": [
{
"_fn": "xpath_one",
"_args": [
"normalize-space(.//p[contains(@class, 'availability')]/text()[last()])"
]
}
]
},
"url": {
"_fns": [
{
"_fn": "xpath_one",
"_args": [
".//a/@href"
]
}
]
}
}
}
}
}
```
With these parsing instructions, the results are categorized by product:
```bash
{
"products": [
{
"availability": "In stock",
"price": "£51.77",
"title": "A Light in the ...",
"url": "a-light-in-the-attic_1000/index.html"
},
{
"availability": "In stock",
"price": "£53.74",
"title": "Tipping the Velvet",
"url": "tipping-the-velvet_999/index.html"
},

...

{
"availability": "In stock",
"price": "£51.33",
"title": "Libertarianism for Beginners",
"url": "libertarianism-for-beginners_982/index.html"
},
{
"availability": "In stock",
"price": "£45.17",
"title": "It's Only the Himalayas",
"url": "its-only-the-himalayas_981/index.html"
}
]
}
```

## Parsing example of a real target

### Product listings

In this section, let’s use Custom Parser to parse [this product
listing page](https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=laptop&_sacat=0&LH_TitleDesc=0&_odkw=laptop&_osacat=0) on eBay:

![](images/ebay_product_listings.png)

The goal is to extract the **title**, **price**, **item condition**,
**URL**, and **seller information** from each product listing.

Here, you can again define the parsing scope. All of the products are
inside the `

` tag with the attribute `data-viewport`, which is under
the `

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/oxylabs/custom-parser-instructions

Awesome Lists containing this project

README