{"id":25228901,"url":"https://github.com/miroshnikov/scrapyteer","last_synced_at":"2025-10-26T06:31:22.284Z","repository":{"id":92348579,"uuid":"346334045","full_name":"miroshnikov/scrapyteer","owner":"miroshnikov","description":"Web crawling \u0026 scraping framework for Node.js on top of headless Chrome browser","archived":false,"fork":false,"pushed_at":"2024-03-03T12:39:18.000Z","size":393,"stargazers_count":18,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-15T21:58:58.673Z","etag":null,"topics":["crawer","crawling","crawling-framework","crawling-sites","crawling-tool","headless","scrape","scraper","scraping","scraping-websites","scrapy","scrapy-crawler","spider","spider-framework","web-crawler","web-crawling","web-scraping","web-scraping-nodejs"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/miroshnikov.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-03-10T11:35:05.000Z","updated_at":"2024-06-01T19:29:31.000Z","dependencies_parsed_at":"2024-03-03T13:28:47.920Z","dependency_job_id":"8d7ff8c1-c54b-451c-a7d8-082fe2413318","html_url":"https://github.com/miroshnikov/scrapyteer","commit_stats":{"total_commits":24,"total_committers":2,"mean_commits":12.0,"dds":"0.41666666666666663","last_synced_commit":"a0548f85bfb8944a8252cde80bfa3c1026b758b0"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miroshnikov%2Fscrapyteer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miroshnikov%2Fscrapyteer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miroshnikov%2Fscrapyteer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miroshnikov%2Fscrapyteer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/miroshnikov","download_url":"https://codeload.github.com/miroshnikov/scrapyteer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238281053,"owners_count":19446078,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawer","crawling","crawling-framework","crawling-sites","crawling-tool","headless","scrape","scraper","scraping","scraping-websites","scrapy","scrapy-crawler","spider","spider-framework","web-crawler","web-crawling","web-scraping","web-scraping-nodejs"],"created_at":"2025-02-11T10:46:16.985Z","updated_at":"2025-10-26T06:31:16.770Z","avatar_url":"https://github.com/miroshnikov.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Scrapyteer\n\nScrapyteer is a Node.js **web scraping** framework/tool/library built on top of the headless Chrome browser Puppeteer.        \nIt allows you to scrape both plain html pages and javascript generated content including SPAs (Single-Page Application) of any kind.\nScrapyteer offers a small set of functions that forms an easy and concise DSL (Domain Specific Language) for web scraping and allows to define a **crawling workflow** and a **shape of output data**. \n\n- [Examples](#examples)\n- [Installation](#installation)\n- [Configuration options](#configuration-options)\n- [API](#api)\n\n## Examples\nScrapyteer uses a configuration file (`scrapyteer.config.js` by default). \nHere are some examples:\n\n### Simple example\nSearch books on [amazon.com](https://www.amazon.com) and get titles and ISBNs of books on the first page of the results.\n\n```js\nconst { pipe, open, select, enter, $$, $, text } = require('scrapyteer');\n\nmodule.exports = {\n    root: 'https://www.amazon.com',\n    parse: pipe(\n        open(),     // open amazon.com\n        select('#searchDropdownBox', 'search-alias=stripbooks-intl-ship'),  // select 'Books' in dropdown\n        enter('#twotabsearchtextbox', 'Web scraping'),   // enter search phrase 'Web scraping'\n        $$('.a-section h2'),    // for every H2 on page\n        {\n            name: text,         // name = inner text of H2 element\n            ISBN: pipe(         // go to link and grab ISBN from there if present\n                $('a'),\n                open(),         // open 'href' attribute of passed A element\n                $('#rpi-attribute-book_details-isbn13 .rpi-attribute-value span'), \n                text            // grab inner text of a previously selected element\n            )\n        }\n    )\n}\n/*\noutput.json\n\n[\n    {\n        \"name\": \"Web Scraping with Python: Collecting More Data from the Modern Web  \",\n        \"ISBN\": \"978-1491985571\"\n    },\n    ...\n]\n*/\n```\n\n### More elaborate example\nSearch books on [amazon.com](https://www.amazon.com), get a number of attributes in `JSON lines` file and download the cover image of each book to a local directory.\n```js\nconst { pipe, open, select, enter, $$, $, text } = require('scrapyteer');\n\nmodule.exports = {\n    root: 'https://www.amazon.com',\n    save: 'books.jsonl',   // saves as jsonl\n    parse: pipe(\n        open(),     // open amazon.com\n        select('#searchDropdownBox', 'search-alias=stripbooks-intl-ship'),  // select 'Books' in dropdown\n        enter('#twotabsearchtextbox', 'Web scraping'),   // enter search phrase\n        $$('.a-section h2 \u003e a'),    // for every H2 link on page\n        open(),         // open 'href' attribute of passed A element\n        {\n                // on book's page grab all the necessary values\n            name: $('#productTitle'),\n            ISBN: $('#rpi-attribute-book_details-isbn13 .rpi-attribute-value span'),\n            stars: pipe($('#acrPopover \u003e span \u003e a \u003e span'), text, parseFloat),  // number of stars as float\n            ratings: pipe($('#acrCustomerReviewLink \u003e span'), text, parseInt),   // convert inner text that looks like 'NNN ratings' into an integer\n            cover: pipe(                // save cover image as a file and set cover = file name\n                $(['#imageBlockContainer img', '#ebooks-main-image-container img']),     // try several selectors\n                save({dir: 'cover-images'})\n            )   \n        }\n    )\n}\n/*\nbooks.jsonl\n\n{\"name\":\"Web Scraping with Python: Collecting More Data from the Modern Web\",\"ISBN\":\"978-1491985571\",\"stars\":4.6,\"ratings\":201,\"cover\":\"sitb-sticker-v3-small._CB485933792_.png\"}\n{\"name\":\"Web Scraping Basics for Recruiters: Learn How to Extract and Scrape Data from the Web\",\"ISBN\":null,\"stars\":4.9,\"ratings\":15,\"cover\":\"41esb-CVhsL.jpg\"}\n...\n*/\n```\n\n## Installation\n### Locally \n```sh\nnpm i -D scrapyteer\nnpm exec -- scrapyteer --config myconf.js.  # OR npx scrapyteer --config myconf.js\n```\n### Locally as dependency\n```sh\nnpm init\nnpm i -D scrapyteer\n```\nin `package.json`:\n```json\n\"scripts\": {\n  \"scrape\": \"scrapyteer --config myconf.js\"\n}\n```\n```sh\nnpm run scrape\n```\n\n### Globally\n```sh\nnpm install -g scrapyteer\nscrapyteer --config myconf.js\n```\nMake sure `$NODE_PATH` points to where global packages are located. \nIf it doesn't, you may need to set it e.g. `export NODE_PATH=/path/to/global/node_modules`\n\n\n## Configuration options\n\n### save \nA file name or `console` object, by default `output.json` in the current directory.     \n`*.json` and `*.jsonl` are currently supported.   \nIf format is `json` the data is first collected in memory and then dumped to the file in one go, in `jsonl` data is written line by line (good for large datasets).\n\n### root\nThe root URL to scrape\n\n### parse\nThe parsing workflow: a `pipe` function, an object or an array\n\n### log\n`log: true` turns on log output for debugging\n\n### noRevisit\nSet `true` to not revisit already visited pages\n\n### options\n```typescript\n    options: {\n        browser: {\n            headless: false\n        }\n    }\n```\n\n\n## API\n\n### pipe\n```typescript\npipe(...args: any[])\n```\nReceives a set of functions and invoke them from left to right supplying the return value of the previous as input for the next. If an argument is not a function, it is converted to one (by `indentity`).    \nFor objects and arrays _all of their items/properties are also parsed_.    \nIf the return value is an `array`, _the rest of the function chain will be invoked for all of its items_.\n\n### open\nOpens a given or root url\n```typescript\nopen(url: string|null = null)\n```\n\n### $ / $$\n```typescript\n$(selector: string|string[]) =\u003e Element|null\n$$(selector: string|string[]) =\u003e Element[]\n```\nCalls `querySelector` / `querySelectorAll` on page/element.     \nIf an array of selectors is passed, uses the first one that exists. It is useful if data may be in various places of the DOM.\n\n### attr\nReturns an element's property value \n```typescript\nattr(name: string)\n```\n\n### text\nReturns a text content of an element\n\n### save\n```typescript\nsave({dir='files'}: {dir: string, saveAs?: (name: string, ext: string) =\u003e string})\n```\nSaves a link to a file and returns the file name.   \n`saveAs` allows to modify a saved file name or extension.\n\n### type\nTypes text into an input\n```typescript\ntype(inputSelector: string, text: string, delay = 0)\n```\n\n### select\nSelects one or more values in a select\n```typescript\nselect(selectSelector: string, ...values: string[])\n```\n\n### enter\nTypes text into an input and presses enter\n```typescript\nenter(inputSelector: string, text: string, delay = 0)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiroshnikov%2Fscrapyteer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmiroshnikov%2Fscrapyteer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiroshnikov%2Fscrapyteer/lists"}