{"id":13447890,"url":"https://github.com/joseconstela/webparsy","last_synced_at":"2025-03-17T09:30:41.913Z","repository":{"id":34291438,"uuid":"174712768","full_name":"joseconstela/webparsy","owner":"joseconstela","description":"Node.JS library and cli for scraping websites using Puppeteer (or not) and YAML definitions","archived":false,"fork":false,"pushed_at":"2022-12-30T19:21:03.000Z","size":686,"stargazers_count":44,"open_issues_count":19,"forks_count":7,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-02-27T21:50:26.207Z","etag":null,"topics":["browser","chrome","headless","nodejs","puppeteer","yaml"],"latest_commit_sha":null,"homepage":"https://www.npmjs.com/package/webparsy","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/joseconstela.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-03-09T15:40:49.000Z","updated_at":"2024-07-18T13:49:36.000Z","dependencies_parsed_at":"2023-01-15T05:56:28.876Z","dependency_job_id":null,"html_url":"https://github.com/joseconstela/webparsy","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joseconstela%2Fwebparsy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joseconstela%2Fwebparsy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joseconstela%2Fwebparsy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joseconstela%2Fwebparsy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/joseconstela","download_url":"https://codeload.github.com/joseconstela/webparsy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243858868,"owners_count":20359257,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["browser","chrome","headless","nodejs","puppeteer","yaml"],"created_at":"2024-07-31T05:01:29.614Z","updated_at":"2025-03-17T09:30:41.520Z","avatar_url":"https://github.com/joseconstela.png","language":"JavaScript","funding_links":[],"categories":["JavaScript","Tools"],"sub_categories":[],"readme":"# ![WebPary logo](logo.png)\n\u003c!-- ALL-CONTRIBUTORS-BADGE:START - Do not remove or modify this section --\u003e\n[![All Contributors](https://img.shields.io/badge/all_contributors-2-orange.svg?style=flat-square)](#contributors-)\n\u003c!-- ALL-CONTRIBUTORS-BADGE:END --\u003e\n\n\u003e Fast and light NodeJS library and cli to scrape and interact with websites using [Puppeteer](https://github.com/GoogleChrome/puppeteer) ([or not](#goto)) and [YAML definitions](https://en.wikipedia.org/wiki/YAML)\n\n```yaml\nversion: 1\njobs:\n  main:\n    steps:\n      - goto: https://github.com/marketplace?category=code-quality\n      - pdf:\n          path: Github_Tools.pdf\n          format: A4\n      - many: \n          as: github_tools\n          event: githubTool\n          selector: main .col-lg-9.mt-1.mb-4.float-lg-right a.col-md-6.mb-4.d-flex.no-underline\n          element:\n            - property:\n                selector: a\n                type: string\n                property: href\n                as: url\n                transform: absoluteUrl\n            - text:\n                selector: h3.h4\n                type: string\n                transform: trim\n                as: name\n            - text:\n                selector: p\n                type: string\n                transform: trim\n                as: description\n```\n\n_Return an array with Github's tools, and creates a PDF. Example output:_\n\n```json\n{\n  \"github_tools\": [\n    {\n      \"url\": \"https://github.com/marketplace/codelingo\",\n      \"name\": \"codelingo\",\n      \"description\": \"Your Code, Your Rules - Automate code reviews with your own best practices\"\n    },\n    {\n      \"url\": \"https://github.com/marketplace/codebeat\",\n      \"name\": \"codebeat\",\n      \"description\": \"Code review expert on demand. Automated for mobile and web\"\n    },\n    ...\n  ]\n}\n```\n\n---\n\nDon't panic. There are examples for all WebParsy features in the examples folder. This are as basic as possible to help you get started.\n\n\n## Contributors ✨\n\nThanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):\n\n\u003c!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section --\u003e\n\u003c!-- prettier-ignore-start --\u003e\n\u003c!-- markdownlint-disable --\u003e\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\u003ca href=\"https://github.com/Dumi-k\"\u003e\u003cimg src=\"https://avatars0.githubusercontent.com/u/23239829?v=4\" width=\"100px;\" alt=\"\"/\u003e\u003cbr /\u003e\u003csub\u003e\u003cb\u003eDumi-k\u003c/b\u003e\u003c/sub\u003e\u003c/a\u003e\u003cbr /\u003e\u003ca href=\"https://github.com/joseconstela/webparsy/issues?q=author%3ADumi-k\" title=\"Bug reports\"\u003e🐛\u003c/a\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003ca href=\"https://www.kiliancm.fr\"\u003e\u003cimg src=\"https://avatars3.githubusercontent.com/u/45060645?v=4\" width=\"100px;\" alt=\"\"/\u003e\u003cbr /\u003e\u003csub\u003e\u003cb\u003eKilianCM\u003c/b\u003e\u003c/sub\u003e\u003c/a\u003e\u003cbr /\u003e\u003ca href=\"#ideas-KilianCM\" title=\"Ideas, Planning, \u0026 Feedback\"\u003e🤔\u003c/a\u003e \u003ca href=\"https://github.com/joseconstela/webparsy/commits?author=KilianCM\" title=\"Code\"\u003e💻\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003c!-- markdownlint-enable --\u003e\n\u003c!-- prettier-ignore-end --\u003e\n\u003c!-- ALL-CONTRIBUTORS-LIST:END --\u003e\n\nThis project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!\n\n##### Table of Contents\n\n- [Overview](#overview)\n- [Browser config](#browser-config)\n- [Output](#output)\n- [Transform](#transform)\n- [Types](#types)\n- [Multi-jobs](#multi-jobs-support)\n- [Steps](#steps)\n  * [setContent](#setContent) Sets the HTML markup to assign to the page.\n  * [goto](#goto) Navigate to an URL\n  * [run](#run) Runs a group of steps by its name.\n  * [goBack](#goBack) Navigate to the previous page in history\n  * [screenshot](#screenshot) Takes an screenshot of the page\n  * [pdf](#pdf) Takes a pdf of the page\n  * [text](#text) Gets the text for a given CSS selector\n  * [many](#many) Returns an array of elements given their CSS selectors\n  * [title](#title) Gets the title for the current page.\n  * [form](#form) Fill and submit forms\n  * [html](#html) Return HTML code for the page or a DOM element\n  * [click](#click) Click on an element (CSS and xPath selectors)\n  * [url](#url) Return the current URL\n  * [type](#type) Types a text (key events) in a given selector\n  * [waitFor](#waitFor) Wait for selectors, time, functions, etc before continuing\n  * [keyboardPress](#keyboardPress) Simulates the press of a keyboard key\n  * [scrollTo](#scrollTo) Scroll to bottom, top, x, y, selector, xPath before continuing\n  * [scrollToEnd](#scrollToEnd) Scroll's to the very bottom (infinite scroll pages)\n\n## Overview\n\nYou can use WebParsy either as cli from your terminal or as a NodeJS library.\n\n### Cli\n\n*Install webparsy:*\n```bash\n$ npm i webparsy -g\n```\n\n```bash\n$ webparsy example/_weather.yml --customFlag \"custom flag value\"\nResult:\n\n{\n  \"title\": \"Madrid, España Pronóstico del tiempo y condiciones meteorológicas - The Weather Channel | Weather.com\",\n  \"city\": \"Madrid, España\",\n  \"temp\": 18\n}\n```\n\n### Library\n\n```javascript\nconst webparsy = require('webparsy')\nconst parsingResult = await webparsy.init({\n  file: 'jobdefinition.yml'\n  flags: { ... } // optional\n})\n```\n\n#### Methods\n\n##### init(options)\n\noptions:\n\nOne of `yaml`, `file` or `string` is required.\n\n- `yaml`: A [yaml npm module](https://www.npmjs.com/package/yaml) instance of the scraping definition.\n- `string`: The YAML definition, as a plain string.\n- `file`: The path for the YAML file containing the scraping definition.\n\nAdditionally, you can pass a `flags` object property to input additional values\nto your scraping process.\n\n## Browser config\n\nYou can setup Chrome's details in the `browser` property within the main job.\n\nNone of the following settings are required.\n\n```yaml\njobs:\n  main:\n    browser:\n      width: 1200\n      height: 800\n      scaleFactor: 1\n      timeout: 60\n      delay: 0\n      headless: true\n      executablePath: ''\n      userDataDir: ''\n      keepOpen: false\n```\n\n- executablePath: If provided, webparsy will launch Chrome from the specified\npath.\n- userDataDir: If provided, webparsy will launch Chrome with the specified\nuser's profile.\n\n## Output\n\nIn order for WebParsy to get contents, it needs some very basic details. This are:\n\n- `as` the property you want to be returned\n- `selector` the css selector to extract the html or text from\n\nOther optional options are\n\n- `parent` Get the parent of the element filtered by a selector. \n\nExample\n\n```yaml\ntext:\n  selector: .entry-title\n  as: entryLink\n  parent: a\n```\n\n## Transform\n\nWhen you extract texts from a web page, you might want to transform the data\nbefore returning them. [example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/transform.yml)\n\nYou can use the following `- transform` methods:\n\n- `uppercase` transforms the result to uppercase\n- `lowercase` transforms the result to lowercase\n- `absoluteUrl` return the absolute url for a link\n\n## Types\n\nWhen extractring details from a page, you might want them to be returned in\ndifferent formats, for example as a number in the example of grabing temperatures.\n[example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/type.yml)\n\nYou can use the following values for `- type`:\n\n- `string`\n- `number` \n- `integer` \n- `float` \n- `fcd` tranform to **f**loat an string-number that uses **c**omma for thousands\n- `fdc` tranform to **f**loat an string-number that uses **d**ot for thousands\n\n## Multi-jobs support\n\nYou can define groups of steps (jobs) that you can reuse at any moment during an\nscraping process.\n\nFor example, let's say you want to signup twice in a website. You will have a\n\"main\" job (that executes by defaul) but you can create an additional one called\n\"signup\", that you can reuse in the \"main\" one.\n\n```yaml\nversion: 1\njobs:\n  main:\n    steps:\n      - goto: https://example.com/\n      - run: signup\n      - click: '#logout'\n      - run: signup\n  signup:\n    steps:\n      - goto: https://example.com/register\n      - form:\n          selector: \"#signup-user\"\n          submit: true\n          fill:\n            - selector: '[name=\"username\"]'\n              value: jonsnow@example.com\n```\n\n## Steps\n\nSteps are the list of things the browser must do.\n\n## setContent\n\nSets the HTML markup to assign to the page.\n\nSetting a string:\n\n```yaml\n- setContent:\n    html: Hello!\n```\n\nLoading the HTML from a file:\n\n```yaml\n- setContent:\n    file: myMarkup.html\n```\n\nLoading the HTML from a environment variable:\n\n```yaml\n- setContent:\n    env: MY_MARKUP_ENVIRONMENT_VARIABLE\n```\n\nLoading the HTML from a flag:\n\n```yaml\n- setContent:\n    flag: markup\n```\n\n## goto\n\nURL to navigate page to. The url should include scheme, e.g. https://. [example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/goBack.yml)\n\n```yaml\n- goto: https://example.com\n```\n\nYou can also tell WebParsy to don't use Puppeteer to browse, and instead do a\ndirect HTTP request via got. This will perform much faster, but it may not be\nsuitable for websites that requires JavaScript. [simple example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/getRequest.yml) / \n[extended example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/many_using_get.yml)\n\nNote that some methods (for example: `form`, `click` and others) will not be\navailable if you are not browsing using puppeteer.\n\n```yaml\n- goto:\n    url: https://google.com\n    method: got\n```\n\nYou can also tell WebParsy which urls it should visit via flags (available via\ncli and library). Example:\n\n```yaml\n- goto:\n    flag: websiteUrl\n```\n\nYou can then call webparsy as:\n\n```bash\nwebparsy definition.yaml --websiteUrl \"https://google.com\"\n```\n\nor \n\n```javascript\nwebparsy.init({\n  file: 'definition.yml'\n  flags: { websiteUrl: 'https://google.com' }\n})\n```\n\n[example](https://github.com/joseconstela/webparsy/blob/master/examples/flags.js)\n\n### Authentication\n\nYou can perform basic HTTP authentication by providing the user and password as in the following example:\n\n```yml\n- goto: \n    url: http://example.com\n    method: got\n    authentication:\n      type: basic\n      username: my_user\n      password: my_password\n```\n\n\n## run\n\nRuns a group of steps by its name.\n\n```yaml\n- run: signupProcess\n```\n\n## goBack\n\nNavigate to the previous page in history. [example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/goBack.yml)\n\n```yaml\n- goBack\n```\n\n## screenshot\n\nTakes an screenshot of the page. This triggers pupetteer's [page.screenshot](https://github.com/GoogleChrome/puppeteer/blob/v1.13.0/docs/api.md#pagescreenshotoptions).\n[example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/screenshot.yml)\n\n```yaml\n- screenshot:\n  - path: Github.png\n```\n\nIf you are using WebParsy as a NodeJS module, you can also get the screenshot\nretuned as a Buffer by using the `as` property.\n\n```yaml\n- screenshot:\n  - as: myScreenshotBuffer\n```\n\n## pdf\n\nTakes a pdf of the page. This triggers pupetteer's [page.pdf](https://github.com/GoogleChrome/puppeteer/blob/v1.13.0/docs/api.md#pagepdfoptions)\n\n```yaml\n- pdf:\n  - path: Github.pdf\n```\n\nIf you are using WebParsy as a NodeJS module, you can also get the PDF file\nretuned as a Buffer by using the `as` property.\n\n```yaml\n- pdf:\n  - as: pdfFileBuffer\n```\n\n## title\n\nGets the title for the current page. If no output.as property is defined, the\npage's title will tbe returned as `{ title }`. [example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/goBack.yml)\n\n```yaml\n- title\n```\n\n## many\n\nReturns an array of elements given their CSS selectors. [example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/many.yml)\n\nExample: \n\n```yaml\n- many: \n  as: articles\n  selector: main ol.articles-list li.article-item\n  element:\n    - text:\n      selector: .title\n      as: title\n```\n\nWhen you scape large amount of contents, you might end consuming hords of RAM,\nyour system might become slow and the scraping process might fail.\n\nTo prevent this, WebParsy allows you to use process events so you can have the\nscraped contents as they are scraped, instead of storing them in memory and\nwaiting for the whole process to finish.\n\nTo do this, simply add an `event` property to `many`, with the event's name you\nwant to listen to. The event will contain each scraped item.\n\n`event` will give you the data as it's being scraped. To prevent it from being\nstored in memory, set `eventMethod` to `discard`.\n\n[Example using events](https://github.com/joseconstela/webparsy/blob/master/examples/many_event.js)\n\n## form\n\nFill and submit forms. [example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/form.yml)\n\nForm filling can use values from environment variables. This is useful if you\nwant to keep users login details in secret. If this is your case, instead of\nspecifying the value as a string, set it as the env property for value. Check\nthe example below or refer to [banking example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/form.yml)\n\nExample: \n\n```yaml\n- form:\n    selector: \"#tsf\"            # form selector\n    submit: true               # Submit after filling all details\n    fill:                      # array of inputs to fill\n      - selector: '[name=\"q\"]' # input selector\n        value: test            # input value\n```\n\nUsing environment variables\n```yaml\n- form:\n    selector: \"#login\"            # form selector\n    submit: true                  # Submit after filling all details\n    fill:                         # array of inputs to fill\n      - selector: '[name=\"user\"]' # input selector\n        value:\n          env: USERNAME           # process.env.USERNAME\n      - selector: '[name=\"pass\"]' \n        value: \n          env: PASSWORD           # process.env.PASSWORD\n```\n\n## html\n\nGets the HTML code. If no `selector` specified, it returns the page's full HTML\ncode. If no output.as property is defined, the result will be returned\nas `{ html }`. [example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/html.yml)\n\nExample: \n\n```yaml\n- html\n    as: divHtml\n    selector: div\n```\n\n## click\n\nClick on an element. [example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/click.yml)\n\nExample:\n\nDefault behaviour (CSS selector)\n```yaml\n- click: button.click-me\n```\n\nSame as\n```yaml\n- click: \n    selector: button.click-me\n```\n\nBy xPath (clicks on the first match)\n```yaml\n- click: \n    xPath: '/html/body/div[2]/div/div/div/div/div[3]/span'\n```\n\n## type\n\nSends a `keydown`, `keypress/input`, and `keyup` event for each character in\nthe text.\n\nExample:\n\n```yaml\n- type:\n    selector: input.user\n    text: jonsnow@example.com\n    options:\n      delay: 4000\n```\n\n## url\n\nReturn the current URL.\n\nExample:\n\n```yaml\n- url:\n    as: currentUrl\n```\n\n## waitFor\n\nWait for specified CSS, XPath selectors, on an specific amount of time before\ncontinuing [example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/form.yml)\n\nExamples: \n\n```yaml\n- waitFor:\n   selector: \"#search-results\"\n```\n\n```yaml\n- waitFor:\n   xPath: \"/html/body/div[1]/header/div[1]/a/svg\"\n```\n\n```yaml\n- waitFor:\n   function: \"console.log(Date.now())\"\n```\n\n```yaml\n- waitFor:\n    time: 1000 # Time in milliseconds\n```\n\n## keyboardPress\n\nSimulates the press of a keyboard key. [extended docs](https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#keyboardpresskey-options)\n\n```yaml\n- keyboardPress: \n    key: 'Enter'\n```\n\n## scrollTo\n\nScoll to specified CSS, XPath selectors, to bottom/top or to specified x/y value before continuing [example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/scrollTo.yml)\n\nExamples: \n\n```yaml\n- scrollTo:\n   top: true\n```\n\n```yaml\n- scrollTo:\n   bottom: true\n```\n\n```yaml\n- scrollTo:\n   x: 340\n```\n\n```yaml\n- scrollTo:\n   y: 500\n```\n\n```yaml\n- scrollTo:\n   selector: \"#search-results\"\n```\n\n```yaml\n- scrollTo:\n   xPath: \"/html/body/div[1]/header/div[1]/a/svg\"\n```\n\n## scrollToEnd\n\nScroll's to the very bottom (infinite scroll pages) [example](https://github.com/joseconstela/webparsy/blob/master/examples/methods/scrollToEnd.yml)\n\nThis accepts three settings:\n- **step:** how many pixels to scroll every time. Default is 10.\n- **max:** up to how many pixels as maximun you want to scroll down - so you are not waiting for decades on non-ending infinite scroll pages. Default is 9999999.\n- **sleep:** how long to wait before scrolls - in milliseconds. Defaul is 100\n\nExamples:\n\n```yaml\n- scrollToEnd\n```\n\n```yaml\n- scrollToEnd:\n    step: 300\n    sleep: 1000\n    max: 300000\n```\n\n## License\n\nMIT © [Jose Constela](https://joseconstela.com)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoseconstela%2Fwebparsy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoseconstela%2Fwebparsy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoseconstela%2Fwebparsy/lists"}