{"id":15547679,"url":"https://github.com/ryu1kn/procedural-page-crawler","last_synced_at":"2026-04-30T07:38:19.008Z","repository":{"id":33916111,"uuid":"98420678","full_name":"ryu1kn/procedural-page-crawler","owner":"ryu1kn","description":"Page Crawler. Tell it where to go and what to look for.","archived":false,"fork":false,"pushed_at":"2023-08-31T03:22:38.000Z","size":106,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-18T18:57:49.990Z","etag":null,"topics":["crawler","npm-package","scraper"],"latest_commit_sha":null,"homepage":"https://www.npmjs.com/package/procedural-page-crawler","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ryu1kn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-07-26T12:32:53.000Z","updated_at":"2022-04-02T09:36:27.000Z","dependencies_parsed_at":"2022-08-07T23:30:41.256Z","dependency_job_id":null,"html_url":"https://github.com/ryu1kn/procedural-page-crawler","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/ryu1kn/procedural-page-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryu1kn%2Fprocedural-page-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryu1kn%2Fprocedural-page-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryu1kn%2Fprocedural-page-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryu1kn%2Fprocedural-page-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ryu1kn","download_url":"https://codeload.github.com/ryu1kn/procedural-page-crawler/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryu1kn%2Fprocedural-page-crawler/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260915902,"owners_count":23082040,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","npm-package","scraper"],"created_at":"2024-10-02T13:10:01.380Z","updated_at":"2026-04-30T07:38:13.986Z","avatar_url":"https://github.com/ryu1kn.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Build](https://github.com/ryu1kn/procedural-page-crawler/workflows/Build/badge.svg?branch=master)\n\n# Procedural Page Crawler\n\nThis crawler does:\n\n* Receive instructions: where to go, what to do\n* Execute every instruction one-by-one, making expression result available to the following steps\n\nYou can use this as a command line tool or a JS library.\n\n## Prerequisite\n\nThis crawler uses Headless Chrome, so Chrome needs to be installed on your machine.\n\n## Disclaimer\n\nThis tool started off as a one-time JS script that helps another project. Later I found myself using\nthis in several of my other projects. When I changed the language to TypeScript, I needed to compile and\npublish it to a npm registry instead of directly installing it from its github repo; so here you see this.\nYou're welcome to use this but I just want to make sure that you have a right expectation... 🙂\n\n## Usage\n\n### Use it as a command line tool\n\n```sh\n$ node_modules/.bin/crawl --rule ./rule.js --output output.json\n```\n\nHere, `rule.js` would look like this. The result will be written to `output.json`.\n\n```js\n// rule.js\nmodule.exports = {\n\n    // Instructions to be executed\n    instructions: [\n        {\n            // URLs to visit\n            locations: ['https://a.example.com'],\n\n            // Expression to be executed in the browser. Expression result will become available\n            // for the following instructions as `context.instructionResults[INSTRUCTION_INDEX]`\n            expression: \"[...document.querySelectorAll('.where-to-go-next')].map(el =\u003e el.innerText)\"\n        },\n        {\n            // locations can be a function\n            locations: context =\u003e {\n                // Use the result of the 1st location of the 1st instruction\n                return context.instructionResults[0][0];\n            },\n            expression: \"[...document.querySelectorAll('.what-to-get')].map(el =\u003e el.innerText)\"\n        }\n    ],\n\n    // Here, the final result is the result of the 2nd instruction\n    output: context =\u003e context.instructionResults[1]\n}\n```\n\n### Use it as a library\n\nYou can do:\n\n```js\nimport {Crawler} from 'procedural-page-crawler';\n\n// Or, if you're still using CommonJS module and not EcmaScript module, then\n// const {Crawler} = await import('procedural-page-crawler');\n\nconst crawler = new Crawler();\nconst rule = {/* The same structure rule you give when you use the Crawler as a command line tool */};\n\ncrawler.crawl({rule}).then(output =\u003e {\n    // `output` is the result of `rule.output` evaluation.\n});\n```\n\nFor more information on how to use it as a library, see `src/bin/crawl.ts`.\n\n## Test\n\n```sh\n$ yarn run test:e2e\n```\n\n## Refs\n\n* [Getting Started with Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome)\n* [Chrome DevTools Protocol Viewer](https://chromedevtools.github.io/devtools-protocol/tot/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fryu1kn%2Fprocedural-page-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fryu1kn%2Fprocedural-page-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fryu1kn%2Fprocedural-page-crawler/lists"}