An open API service indexing awesome lists of open source software.

https://github.com/remotemerge/xpath-parser

JavaScript utility for extracting data from HTML and XML documents!
https://github.com/remotemerge/xpath-parser

delay dom javascript query scraper scraping subquery typescript xpath xpath-expression

Last synced: 12 days ago
JSON representation

JavaScript utility for extracting data from HTML and XML documents!

Awesome Lists containing this project

README

        

# XPath Parser

[![Package](https://img.shields.io/npm/v/@remotemerge/xpath-parser?logo=npm)](https://www.npmjs.com/package/@remotemerge/xpath-parser)
![Build](https://img.shields.io/github/actions/workflow/status/remotemerge/xpath-parser/production.yml?logo=github)
![Downloads](https://img.shields.io/npm/dt/@remotemerge/xpath-parser)
![License](https://img.shields.io/npm/l/@remotemerge/xpath-parser)

XPath Parser is a JavaScript utility for extracting data from HTML and XML documents; built for web scraping in a JavaScript
environment. It's open source, modern, lightweight and fast. You can easily integrate it into new or existing web
crawlers, browser extensions, etc.

## Install

```bash
# using NPM
npm i @remotemerge/xpath-parser
# using Yarn
yarn add @remotemerge/xpath-parser
```

## Usage

Import the XPathParser class in your project.

```javascript
import XPathParser from '@remotemerge/xpath-parser'
```

## Examples

The XPathParser constructor `XPathParser(html|DOM)` supports both DOM and HTML string, initialize as required.

```javascript
const parser = new XPathParser('...');
```

### Scrape First Match

This method evaluates the given expression and captures the first result. It is useful for scraping a single element
value like `title`, `price`, etc. from HTML pages.

```javascript
const result = parser.queryFirst('//span[@id="productTitle"]');
console.log(result);
```

Sample output:

```text
LETSCOM Fitness Tracker HR, Activity Tracker Watch with Heart Rate...
```

### Scrape All Matches

This method evaluates the given expression and captures all results. It is useful for scraping all URLs, all images, all
CSS classes, etc. from HTML pages.

```javascript
// scrape titles
const results = parser.queryList('//span[contains(@class, "zg-item")]/a/div');
console.log(results);
```

Sample output:

```javascript
['Cell Phone Stand,Angle Height Adjusta…', 'Selfie Ring Light with Tripod…', 'HOVAMP MFi Certified Nylon…', '...']
```

### Scrape multiple elements

This method loop through the given expressions and captures the first match of each expression. It is useful for
scraping full product information (`title`, `seller`, `price`, `rating`, etc.) from HTML pages. The keys are preserved
and the values are returned to the same keys.

```javascript
const result = parser.multiQuery({
title: '//div[@id="ppd"]//span[@id="productTitle"]',
seller: '//div[@id="ppd"]//a[@id="bylineInfo"]',
price: '//div[@id="ppd"]//span[@id="priceblock_dealprice"]',
rating: '//div[@id="ppd"]//span[@id="acrCustomerReviewText"]',
});
```

Sample output:

```text
{
title: 'LETSCOM Fitness Tracker HR, Activity Tracker Watch with Heart Rate Monitor...',
seller: 'LETSCOM',
price: '$20.39',
rating: '1,489 ratings',
}
```

### Scrape with SubQueries

This method captures the `root` element and runs queries within its namespace. It is useful for scraping multiple
products and full information about each product. For example, there can be 10 products on a page and each product
has (`title`, `url`, `image`, `price`, etc.). This method also supports `pagination` parameter. The keys are preserved
and the values are returned to the same keys. Here `pagination` is optional parameter.

```javascript
const result = parser.subQuery({
root: '//span[contains(@class, "zg-item")]',
pagination: '//ul/li/a[contains(text(), "Next")]/@href',
queries: {
title: 'a/div/@title',
url: 'a/@href',
image: 'a/span/div/img/@src',
price: './/span[contains(@class, "a-color-price")]',
}
});
console.log(result);
```

Sample output:

```text
{
paginationUrl: 'https://www.example.com/gp/new-releases/wireless/reTF8&pg=2',
results: [
{
title: 'Cell Phone Stand,Angle Height Adjustable Stab/Kindle/Tablet,4-10inch',
url: '/Adjustable-LISEN-Aluminum-Compatible-4-10&refRID=H1HWDWERK8YCRN76ER1T',
image: 'https://images-na.ssl-images-example.com/images/I/61UL200_SR200,200_.jpg',
price: '$16.99'
},
{
title: 'Selfie Ring Light with Tripod Stand and Pheaming Photo Photography Vlogging Video',
url: '/Selfie-Lighting-Steaming-Photography-Vlogging/dp/B081SV&K8YCRN76ER1T',
image: 'https://images-na.ssl-images-example.com/images/I/717L200_SR200,200_.jpg',
price: '$46.99'
},
{
// ...
}
]
}
```

### Wait for Element

This method waits until the element (matches by expression) exists on a page. The first parameter `expression` is XPath
expression to match and the second parameter `maxSeconds` is the maximum time to wait in seconds (default to 10 seconds)
.

```javascript
parser.waitXPath('//span[contains(@class, "a-color-price")]/span')
.then((response) => {
// expression match and element exists
}).catch((error) => {
// match nothing and timeout
});
```

## Contribution

Welcome the community for contribution. Please make a PR request for bug fixes, enhancements, new features, etc.

## Disclaimer

All the XPath expressions above are tested on Amazon [product listing] and related pages for educational purposes only.
The icons are included from [flaticon] website.

[product listing]: https://www.amazon.com/gp/new-releases/wireless

[flaticon]: https://www.flaticon.com