Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ryu1kn/procedural-page-crawler
Page Crawler. Tell it where to go and what to look for.
https://github.com/ryu1kn/procedural-page-crawler
crawler npm-package scraper
Last synced: about 1 month ago
JSON representation
Page Crawler. Tell it where to go and what to look for.
- Host: GitHub
- URL: https://github.com/ryu1kn/procedural-page-crawler
- Owner: ryu1kn
- Created: 2017-07-26T12:32:53.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2023-08-31T03:22:38.000Z (about 1 year ago)
- Last Synced: 2024-10-09T13:53:34.715Z (about 1 month ago)
- Topics: crawler, npm-package, scraper
- Language: TypeScript
- Homepage: https://www.npmjs.com/package/procedural-page-crawler
- Size: 104 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
![Build](https://github.com/ryu1kn/procedural-page-crawler/workflows/Build/badge.svg?branch=master)
# Procedural Page Crawler
This crawler does:
* Receive instructions: where to go, what to do
* Execute every instruction one-by-one, making expression result available to the following stepsYou can use this as a command line tool or a JS library.
## Prerequisite
This crawler uses Headless Chrome, so Chrome needs to be installed on your machine.
## Disclaimer
This tool started off as a one-time JS script that helps another project. Later I found myself using
this in several of my other projects. When I changed the language to TypeScript, I needed to compile and
publish it to a npm registry instead of directly installing it from its github repo; so here you see this.
You're welcome to use this but I just want to make sure that you have a right expectation... 🙂## Usage
### Use it as a command line tool
```sh
$ node_modules/.bin/crawl --rule ./rule.js --output output.json
```Here, `rule.js` would look like this. The result will be written to `output.json`.
```js
// rule.js
module.exports = {// Instructions to be executed
instructions: [
{
// URLs to visit
locations: ['https://a.example.com'],// Expression to be executed in the browser. Expression result will become available
// for the following instructions as `context.instructionResults[INSTRUCTION_INDEX]`
expression: "[...document.querySelectorAll('.where-to-go-next')].map(el => el.innerText)"
},
{
// locations can be a function
locations: context => {
// Use the result of the 1st location of the 1st instruction
return context.instructionResults[0][0];
},
expression: "[...document.querySelectorAll('.what-to-get')].map(el => el.innerText)"
}
],// Here, the final result is the result of the 2nd instruction
output: context => context.instructionResults[1]
}
```### Use it as a library
You can do:
```js
import {Crawler} from 'procedural-page-crawler';// Or, if you're still using CommonJS module and not EcmaScript module, then
// const {Crawler} = await import('procedural-page-crawler');const crawler = new Crawler();
const rule = {/* The same structure rule you give when you use the Crawler as a command line tool */};crawler.crawl({rule}).then(output => {
// `output` is the result of `rule.output` evaluation.
});
```For more information on how to use it as a library, see `src/bin/crawl.ts`.
## Test
```sh
$ yarn run test:e2e
```## Refs
* [Getting Started with Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome)
* [Chrome DevTools Protocol Viewer](https://chromedevtools.github.io/devtools-protocol/tot/)