https://github.com/remotemerge/xpath-parser
JavaScript utility for extracting data from HTML and XML documents!
https://github.com/remotemerge/xpath-parser
delay dom javascript query scraper scraping subquery typescript xpath xpath-expression
Last synced: 12 days ago
JSON representation
JavaScript utility for extracting data from HTML and XML documents!
- Host: GitHub
- URL: https://github.com/remotemerge/xpath-parser
- Owner: remotemerge
- License: mit
- Created: 2020-04-09T06:43:23.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2024-04-09T11:38:31.000Z (about 1 year ago)
- Last Synced: 2025-04-16T15:37:19.191Z (20 days ago)
- Topics: delay, dom, javascript, query, scraper, scraping, subquery, typescript, xpath, xpath-expression
- Language: TypeScript
- Homepage:
- Size: 1.28 MB
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
#
XPath Parser
[](https://www.npmjs.com/package/@remotemerge/xpath-parser)


XPath Parser is a JavaScript utility for extracting data from HTML and XML documents; built for web scraping in a JavaScript
environment. It's open source, modern, lightweight and fast. You can easily integrate it into new or existing web
crawlers, browser extensions, etc.## Install
```bash
# using NPM
npm i @remotemerge/xpath-parser
# using Yarn
yarn add @remotemerge/xpath-parser
```## Usage
Import the XPathParser class in your project.
```javascript
import XPathParser from '@remotemerge/xpath-parser'
```## Examples
The XPathParser constructor `XPathParser(html|DOM)` supports both DOM and HTML string, initialize as required.
```javascript
const parser = new XPathParser('...');
```### Scrape First Match
This method evaluates the given expression and captures the first result. It is useful for scraping a single element
value like `title`, `price`, etc. from HTML pages.```javascript
const result = parser.queryFirst('//span[@id="productTitle"]');
console.log(result);
```Sample output:
```text
LETSCOM Fitness Tracker HR, Activity Tracker Watch with Heart Rate...
```### Scrape All Matches
This method evaluates the given expression and captures all results. It is useful for scraping all URLs, all images, all
CSS classes, etc. from HTML pages.```javascript
// scrape titles
const results = parser.queryList('//span[contains(@class, "zg-item")]/a/div');
console.log(results);
```Sample output:
```javascript
['Cell Phone Stand,Angle Height Adjusta…', 'Selfie Ring Light with Tripod…', 'HOVAMP MFi Certified Nylon…', '...']
```### Scrape multiple elements
This method loop through the given expressions and captures the first match of each expression. It is useful for
scraping full product information (`title`, `seller`, `price`, `rating`, etc.) from HTML pages. The keys are preserved
and the values are returned to the same keys.```javascript
const result = parser.multiQuery({
title: '//div[@id="ppd"]//span[@id="productTitle"]',
seller: '//div[@id="ppd"]//a[@id="bylineInfo"]',
price: '//div[@id="ppd"]//span[@id="priceblock_dealprice"]',
rating: '//div[@id="ppd"]//span[@id="acrCustomerReviewText"]',
});
```Sample output:
```text
{
title: 'LETSCOM Fitness Tracker HR, Activity Tracker Watch with Heart Rate Monitor...',
seller: 'LETSCOM',
price: '$20.39',
rating: '1,489 ratings',
}
```### Scrape with SubQueries
This method captures the `root` element and runs queries within its namespace. It is useful for scraping multiple
products and full information about each product. For example, there can be 10 products on a page and each product
has (`title`, `url`, `image`, `price`, etc.). This method also supports `pagination` parameter. The keys are preserved
and the values are returned to the same keys. Here `pagination` is optional parameter.```javascript
const result = parser.subQuery({
root: '//span[contains(@class, "zg-item")]',
pagination: '//ul/li/a[contains(text(), "Next")]/@href',
queries: {
title: 'a/div/@title',
url: 'a/@href',
image: 'a/span/div/img/@src',
price: './/span[contains(@class, "a-color-price")]',
}
});
console.log(result);
```Sample output:
```text
{
paginationUrl: 'https://www.example.com/gp/new-releases/wireless/reTF8&pg=2',
results: [
{
title: 'Cell Phone Stand,Angle Height Adjustable Stab/Kindle/Tablet,4-10inch',
url: '/Adjustable-LISEN-Aluminum-Compatible-4-10&refRID=H1HWDWERK8YCRN76ER1T',
image: 'https://images-na.ssl-images-example.com/images/I/61UL200_SR200,200_.jpg',
price: '$16.99'
},
{
title: 'Selfie Ring Light with Tripod Stand and Pheaming Photo Photography Vlogging Video',
url: '/Selfie-Lighting-Steaming-Photography-Vlogging/dp/B081SV&K8YCRN76ER1T',
image: 'https://images-na.ssl-images-example.com/images/I/717L200_SR200,200_.jpg',
price: '$46.99'
},
{
// ...
}
]
}
```### Wait for Element
This method waits until the element (matches by expression) exists on a page. The first parameter `expression` is XPath
expression to match and the second parameter `maxSeconds` is the maximum time to wait in seconds (default to 10 seconds)
.```javascript
parser.waitXPath('//span[contains(@class, "a-color-price")]/span')
.then((response) => {
// expression match and element exists
}).catch((error) => {
// match nothing and timeout
});
```## Contribution
Welcome the community for contribution. Please make a PR request for bug fixes, enhancements, new features, etc.
## Disclaimer
All the XPath expressions above are tested on Amazon [product listing] and related pages for educational purposes only.
The icons are included from [flaticon] website.[product listing]: https://www.amazon.com/gp/new-releases/wireless
[flaticon]: https://www.flaticon.com