{"id":15543679,"url":"https://github.com/michaeluno/web-page-dumper","last_synced_at":"2025-04-23T17:28:38.457Z","repository":{"id":152019366,"uuid":"314176202","full_name":"michaeluno/web-page-dumper","owner":"michaeluno","description":"Dumps web page outputs including JavaScript generated contents.","archived":false,"fork":false,"pushed_at":"2024-05-04T07:55:22.000Z","size":3170,"stargazers_count":5,"open_issues_count":3,"forks_count":28,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-03T12:33:00.308Z","etag":null,"topics":["corss-domain-solution","cross-domain","cross-domain-requests","dumper","heroku","heroku-app","heroku-application","node-js","nodejs","pdf","proxy","scraper","web-proxy"],"latest_commit_sha":null,"homepage":"https://web-page-dumper.herokuapp.com","language":"CSS","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/michaeluno.png","metadata":{"files":{"readme":"README.md","changelog":"ChangeLog.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-19T07:59:23.000Z","updated_at":"2024-02-29T22:40:16.000Z","dependencies_parsed_at":null,"dependency_job_id":"6e1eb5dd-009b-4a62-9355-a8ebc17ba18e","html_url":"https://github.com/michaeluno/web-page-dumper","commit_stats":null,"previous_names":[],"tags_count":22,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaeluno%2Fweb-page-dumper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaeluno%2Fweb-page-dumper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaeluno%2Fweb-page-dumper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaeluno%2Fweb-page-dumper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/michaeluno","download_url":"https://codeload.github.com/michaeluno/web-page-dumper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250479848,"owners_count":21437442,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corss-domain-solution","cross-domain","cross-domain-requests","dumper","heroku","heroku-app","heroku-application","node-js","nodejs","pdf","proxy","scraper","web-proxy"],"created_at":"2024-10-02T12:28:00.442Z","updated_at":"2025-04-23T17:28:38.450Z","avatar_url":"https://github.com/michaeluno.png","language":"CSS","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Page Dumper\nDumps web page outputs including JavaScript generated contents.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/image/screenshot.jpg\" width=\"600\" alt=\"Web Page Dumper\" title=\"screenshot\"\u003e \n\u003c/p\u003e\n\n## Demo\nVisit [here](https://web-page-dumper.herokuapp.com/). If the server is sleeping, it takes several seconds to wake up. \n\n## Usage\n\nAccess the app address following the path `/www/` with query parameters of the GET or POST method. \n\ne.g.\n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fexample.org\n```\n\n### Query Parameters\n\nOnly the `url` parameter is required. The rest is optional. \n\nFor boolean values, use `1` or `0` instead of `true` or `false`.\n\n#### (required, string) `url`\nA _URL-encoded_ URL to fetch. \n\n\u003e Note: It is important to pass an URL-encoded value especially when the URL includes query parameters not to mix with the current parameters and the requested URL parameters.   \n\ne.g.\n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fgithub.com%2F\n```\n\n\n#### (string) `output`\nThe output type. Accepts the following values:\n- `json` (default) - outputs the site source code, the HTTP header, the HTTP status code, and content type as JSON with the following root keys: \n  - `url` - (string) the requested URL.\n  - `query` - (array) the HTTP request query key-value pairs.\n  - `resourceType` - (string) the request source type.\n  - `contentType` - (string) the HTTP response content type, same as the HTTP header `Content-Type` entry.\n  - `status` - (integer) the HTTP status code as a number such as `200` and `404`.\n  - `heaers` - (array) the HTTP header.\n  - `body`   - (string) the HTTP body, usually an HTML document.\n- `text`, `txt` - outputs the site source as a text document. Use this for non-html documents such as XML and JSON. \n- `html`, `htm` - outputs the site source as `html` or `htm`. HTTP header will be omitted.\n- `mhtml` - outputs the site source as `mhtml`.\n- `png`, `jpg`, `jpeg` - outputs a screenthot image of the site\n- `pdf`\n\n#### (array) `omit` [1.7.0+] \nWhat elements to omit when the `json` output is specified. Pass non-empty values such as 1. \n\ne.g. This omits the `query` and `body` elements from the response. \n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fwww.google.com\u0026output=json\u0026omit[query]=1\u0026omit[body]=1\n```\n\n#### (array) `viewport`\n\nSets how the browser should be viewed.\n\ne.g.\n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fwww.google.com\u0026output=jpg\u0026set_viewport=1\u0026viewport[width]=800\u0026viewpor[height]=1200\u0026viewport[deviceScaleFactor]=5\n```\n \nAccepts the following arguments, same as Puppeteer's `page.setViewport()` method arguments.\n \n\u003e  - `width` (number) page width in pixels.\n\u003e  - `height` (number) page height in pixels.\n\u003e  - `deviceScaleFactor` (number) Specify device scale factor (can be thought of as dpr). Defaults to `1`.\n\u003e  - `isMobile` (boolean) Whether the `meta viewport` tag is taken into account. Defaults to `false`.\n\u003e  - `isLandscape` (boolean) Specifies if viewport is in landscape mode. Defaults to `false`.\n\u003e  \n\u003e -- [Puppeteer API Tip-Of-Tree page.setViewport(viewport)][1]\n\n[1]: https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagesetviewportviewport\n\nDoes not accept the following arguments.\n\n- `hasTouch`\n\n#### (array) `screenshot`\n\nSets screenshot options. This takes effect when the `output` parameter is either of `jpg`, `jpeg`, `png`, or `gif`. \n\ne.g. \n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fgithub.com%2F\u0026output=jpg\u0026screenshot[quality]=10\u0026screenshot[omitBackground]=1\n```\n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fgoogle.com%2F\u0026output=png\u0026screenshot[clip][x]=50\u0026screenshot[clip][y]=80\u0026screenshot[clip][width]=700\u0026screenshot[clip][height]=200\n```\n\nAccepts the following arguments, same as Puppeteer's `page.setViewport()` method arguments.\n  \n\u003e  - `quality` (number) The quality of the image, between 0-100. Not applicable to `png` images.\n\u003e  - `clip` (object) An object which specifies clipping region of the page. Should have the following fields:\n\u003e    - `x` (number) x-coordinate of top-left corner of clip area\n\u003e    - `y` (number) y-coordinate of top-left corner of clip area\n\u003e    - `width` (number) width of clipping area\n\u003e    - `height` (number) height of clipping area\n\u003e  - `omitBackground` (boolean) Hides default white background and allows capturing screenshots with transparency. Defaults to `false`.\n\u003e \n\u003e -- [Puppeteer API Tip-Of-Tree page.screenshot([options])][2]\n\n[2]: https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagescreenshotoptions\n\nDoes not accept the following arguments.\n\n- `path`\n- `encoding`\n- `type`\n- `fullPage` - when the `clip` argument is not set, the full page screenshot will be taken.  \n\n#### (integer) `reload`  \nSpecifies whether to reload page in the internal browser. This is useful for cookie-dependant web pages.  \n\nAccepts `0`, `1`, or `2`.\n\n - `0`: does not reload the page.\n - `1`: reloads only when the HTTP status is larger or equal to `400`, such as `404`, `500`.\n - `2`: reloads regardless of the HTTP status.\n \nIf a value that is not listed above is passed and it yields `true`, the value of `2` will be applied. \n\n#### (integer) `cache`\nDecides whether to use browser caches.\n\nAccepts `1` or `0`.\n\n#### (integer) `timeout`\nThe browser connection timeout in milliseconds. \n\nIf the `WPD_TIMEOUT` environment variable value is set and shorter than this value, the `WPD_TIMEOUT` value will be used.\n\nDefault: `29000`.\n\n#### (string) `user_agent`\nSpecifies a user agent.\n\n#### (string) `username`\nFor a site that requires a basic authentication, set a user name with this parameter.\n\n#### (string) `password`\nFor a site that requires a basic authentication, set a password with this parameter.\n\n#### (array) `pdf`\nWhen the output type is set to `pdf`, the following sub-arguments of the `pdf` parameter is accepted. \n\nFor more details please see [puppeteer's pdf options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagepdfoptions) as the arguments are the same except some unsupported arguments. \n\ne.g.\n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fgithub.com\u0026output=pdf\u0026pdf[scale]=0.5\u0026pdf[printBackground]=1\u0026pdf[pageRanges]=1-3\u0026pdf[format]=Legal\n```\n\n##### Accepted Arguments  \n\n\u003e  - `scale` (number) Scale of the webpage rendering. Defaults to `1`. Scale amount must be between 0.1 and 2.\n\u003e  - `displayHeaderFooter` (boolean) Display header and footer. Defaults to `false`.\n\u003e  - `headerTemplate` (string) HTML template for the print header. Should be valid HTML markup with following classes used to inject printing values into them:\n\u003e    - `date` formatted print date\n\u003e    - `title` document title\n\u003e    - `url` document location\n\u003e    - `pageNumber` current page number\n\u003e    - `totalPages` total pages in the document\n\u003e  - `footerTemplate` (string) HTML template for the print footer. Should use the same format as the `headerTemplate`.\n\u003e  - `printBackground` (boolean) Print background graphics. Defaults to `false`.\n\u003e  - `landscape` (boolean) Paper orientation. Defaults to `false`.\n\u003e  - `pageRanges` (string) Paper ranges to print, e.g., '1-5, 8, 11-13'. Defaults to the empty string, which means print all pages.\n\u003e  - `format` (string) Paper format. If set, takes priority over `width` or `height` options. Defaults to 'Letter'. Accepts the following values.\n\u003e    - `Letter`: 8.5in x 11in\n\u003e    - `Legal`: 8.5in x 14in\n\u003e    - `Tabloid`: 11in x 17in\n\u003e    - `Ledger`: 17in x 11in\n\u003e    - `A0`: 33.1in x 46.8in\n\u003e    - `A1`: 23.4in x 33.1in\n\u003e    - `A2`: 16.54in x 23.4in\n\u003e    - `A3`: 11.7in x 16.54in\n\u003e    - `A4`: 8.27in x 11.7in\n\u003e    - `A5`: 5.83in x 8.27in\n\u003e    - `A6`: 4.13in x 5.83in  \n\u003e  - `width` (string|number) Paper width, accepts values labeled with units.\n\u003e  - `height` (string|number) Paper height, accepts values labeled with units.\n\u003e  - `margin` (object) Paper margins, defaults to none.\n\u003e    - `top` (string|number) Top margin, accepts values labeled with units.\n\u003e    - `right` (string|number) Right margin, accepts values labeled with units.\n\u003e    - `bottom` (string|number) Bottom margin, accepts values labeled with units.\n\u003e    - `left` (string|number) Left margin, accepts values labeled with units.\n\u003e  - `preferCSSPageSize` (boolean) Give any CSS `@page` size declared in the page priority over what is declared in `width` and `height` or `format` options. Defaults to `false`, which will scale the content to fit the paper size.\n\u003e\n\u003e The `width`, `height`, and `margin` options accept values labeled with units. Unlabeled values are treated as pixels.\n\u003e \n\u003e All possible units are:\n\u003e  - `px` - pixel\n\u003e  - `in` - inch\n\u003e  - `cm` - centimeter\n\u003e  - `mm` - millimeter\n\u003e\n\u003e -- [Puppeteer API Tip-Of-Tree page([options])][3]\n\n[3]: https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagepdfoptions\n\n##### Unsupported Arguments\n  - `path` (string) \n\n#### (array) `headers`\nAdditional HTTP headers sent to the page.\n\n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fgoogle.com%2F\u0026output=jpg\u0026headers[Accept-Language]=en\u0026headers[dnt]=1\n```\n\n#### (array) `cookies`\nCookies to set. \n\nAccepts a linear array holding objects with the following key-value pairs.\n\n  - `name` \u003cstring\u003e required\n  - `value` \u003cstring\u003e required\n  - `domain` \u003cstring\u003e   \n  - `url` \u003cstring\u003e \n  - `path` \u003cstring\u003e\n  - `expires` \u003cnumber\u003e Unix time in seconds.\n  - `httpOnly` \u003cboolean\u003e\n  - `secure` \u003cboolean\u003e\n  - `sameSite` \u003c\"Strict\"|\"Lax\"\u003e\n\nIf the `domain` argument is missing, the `url` argument will be automatically set with the requesting URL.\n\n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fgoogle.com%2F\u0026output=jpg\u0026cookies[0][name]=foo\u0026cookies[0][value]=bar\u0026\n```\n\n#### (array) `args`\n\nThe `args` argument for the `puppeteer.launch()` method. For accepted arguments, please see [here](https://peter.sh/experiments/chromium-command-line-switches/).\n\ne.g.\n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fgoogle.com%2F\u0026output=jpg\u0026args[]=--lang=en-GB\n```\n\n#### (string) `proxy`\n\nFormat: `scheme://username:password@ipaddress:port`\n\nFor example, to set `socks4://127.0.0.1:1080`,\n\n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fwww.google.com\u0026output=png\u0026proxy=socks4%3A%2F%2F127.0.0.1:1080\n```\n\n#### (array) `block`\nBlocks specified resources. This has the following sub argument keys.\n- types\n- urls\n\n##### (array) `types` \nSpecifies the types to block.\n\nAccepted values:\n- `image`\n- `stylesheet`\n- `font`\n- `script` \n\nBy default, when the output type is `html' or `json`, and no `block` value is passed, `image`, `stylesheet`, and `font` are added by default.  \n\n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fwww.amazon.com%2Fgp%2Fgoldbox\u0026output=png\u0026block[types][]=script\n```\n\n##### (array) `urls`\nSpecifies the part of URLs to block. Use asterisk (`*`) to match any characters.\n\nSuch as: \n- `*.optimizely.com`\n- `googleadservices.com`\n\n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fwww.amazon.com%2Fgp%2Fgoldbox\u0026output=png\u0026block[urls][]=googleadservices.com\n```\n\n#### (string|array) `waitUntil`\nDetermines when Puppeteer decides the page is fully loaded. The same as the `waitUntil` parameter of the `goto()` page method.. Accepted values are `load`, `domcontentloaded`, `networkidle0`, and `networkidle2`. \n\nDefault: `load`.\n\n\u003e  - `load` - consider navigation to be finished when the load event is fired.\n\u003e  - `domcontentloaded` - consider navigation to be finished when the DOMContentLoaded event is fired.\n\u003e  - `networkidle0` - consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.\n\u003e  - `networkidle2` - consider navigation to be finished when there are no more than 2 network connections for at least 500 ms.\n\u003e\n\u003e -- [Puppeteer API Tip-Of-Tree page([options])][4]\n\n[4]: https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options\n\n#### (array) `action`\nPerforms certain actions on the loaded web page such as click, remove, type, wait for something and so on.\n\nThe action parameter must be a numeric linear array holding key-value pairs of action type and action value. \n\nFor example, the following request will perform a search on DuckDuckGo. \n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fduckduckgo.com\u0026output=png\u0026action[0][select]=%23search_form_input_homepage\u0026action[1][type]=Web%20Page%20Dumper\u0026action[2][click]=%23search_button_homepage\u0026action[3][waitForNavigation]=\n```\n\nNotice that actions are performed sequentially. In the above example, it is interpreted as\n\n```\n[\n  {\n    select: #search_form_input\n  },\n  {\n    type: Web Page Dumper\n  },\n  {\n    click: #search_button_homepage\n  },\n  {\n    waitForNavigation: \n  },\n]\n```\n\n##### Action Types\nThe available action types are as follows.\n\n###### (selector) select\nSelects elements specified with a selector. Use this before an action that does not have a selector parameter.\n\nAccepts a value of selector. The selector can be XPath. \n\n###### (selector) click\n\nClicks a first found element specified with a selector.\n\nAccepts a value of selector. The selector can be XPath.\n\nThis clicks on the top-right icon that expands the app panel on Google home page.\n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fgoogle.com\u0026output=jpg\u0026action[][click]=a.gb_C\n```\n\n###### (selector) remove\nRemoves elements specified with a selector. \n\nAccepts a value of selector. The selector can be XPath.\n\nThis removes the top banner element on the Google search page. Notice that `.k1zIA` is the class selector of the banner container.\n\n```\nhttp(s)://{app address}/www/?url=https%3A%2F%2Fgoogle.com\u0026output=png\u0026action[][remove]=.k1zIA\n```\n\n###### (characters) type\nTypes given characters. \n\nThis action does not accept a selector. Use the `select` action before this to specify an element for typing.\n\n###### (selector) choose\nSelects an item from a `\u003cselect\u003e` tag.\n\nAccepts a value of selector. _* This action does not support XPath._\n\n###### (selector) extract\nExtracts elements.\n\nThis replaces the elements' innerHTML with the HTML body tag inner HTML. Use this to lighten up the result source code.\n\nAccepts a value of selector.\n\n###### (selector) extractHard \nExtracts elements.\n\nSimilar to the `extract` action except that this removes all head tag elements so the styles will be lost.\n\nAccepts a value of selector.\n\n###### (selector) waitForElement\nWaits for an element to appear.\n\nAccepts a value of selector. The selector can be XPath.\n\n###### (void) waitForNavigation\nWaits for the next page to load, used with clicking a link or submitting a form.\n\n###### (integer) waitForTimeout\nWaits for certain milliseconds.\n\nAccepts a value of positive number.\n\n### Setting Connection Timeout\nOn Heroku, each HTTP request should be responded within 30 seconds to avoid the recurring 503 error.\n\nTo set the timeout, use the `WPD_TIMEOUT` environment variable. It accepts milliseconds such as `29000`.   \n\nThere are mainly two options: \n\n- **a)** Create a file named `.env` with the following entry in the project root directory (the same location as app.js).\n\n```\nWPD_TIMEOUT=29000\n```\n\n- **b)** On Heroku, go to _**Dashboard**_ -\u003e _(Choose your App)_ -\u003e _**Settings**_ -\u003e _**Config Vars**_ and add `WPD_TIMEOUT` with a value such as `29000`.\n\n### Logging\n\n#### Enabling Log Pages\nTo enable the access to the app's log, you need to set an environment variable of `WPD_LOG_ROUTE` with a value serving as the root name (part of URL path). \n\nThere are mainly two options: \n\n- **a)** Create a file named `.env` with the following entry in the project root directory (the same location as app.js).\n\n```\n#LOGGING\nWPD_LOG_ROUTE=log\n```\n\n- **b)** On Heroku, go to _**Dashboard**_ -\u003e _(Choose your App)_ -\u003e _**Settings**_ -\u003e _**Config Vars**_ and add `WPD_LOG_ROUTE` with a value such as `log`.\n\nIn the above examples, `log` is used for the route name. You can set your desired name. \n\n#### Log Pages\nThere are four log types available, which are, `request`, `browser`, `debug` and `error`. Say, the route name is `log`, then the following pages will be available.\n\n##### request\nLogs HTTP requests. \n\nFormat:\n```\nhttp(s)://{app address}/{log route}/request/{YYYY-MM-DD}\n```\n\nExample:\n```\nhttps://web-page-dumper.herokuapp.com/log/request/2021-06-27\n```\n\n\n##### browser\nLogs browser activities.\n\nFormat:\n```\nhttp(s)://{app address}/{log route}/browser/{YYYY-MM-DD}\n```\n\nExample:\n```\nhttps://web-page-dumper.herokuapp.com/log/browser/2021-06-27\n```\n\n##### debug\nLogs debug information.\n\nFormat:\n```\nhttp(s)://{app address}/{log route}/debug/{YYYY-MM-DD}\n```\n\nExample:\n```\nhttps://web-page-dumper.herokuapp.com/log/debug/2021-06-27\n```\n\n##### error\nLogs errors.\n\nFormat:\n```\nhttp(s)://{app address}/{log route}/error/{YYYY-MM-DD}\n```\n\nExample:\n```\nhttps://web-page-dumper.herokuapp.com/log/error/2021-06-27\n```\n\n## Deployment to Heroku\nThis web application is meant to run on [Heroku](https://www.heroku.com/). \n\n1. Log in to Heroku. If you don't have an account create a [Heroku account](https://id.heroku.com/).\n2. Click [![Deploy](https://www.herokucdn.com/deploy/button.svg)](https://heroku.com/deploy?template=https://github.com/michaeluno/web-page-dumper)\n3. In the following page, enter your desired app name and press the `Deploy App` button which will start deploying.\n4. After finishing the deployment, click on `Manage App`.\n5. In the following page, click on `Open App`.   \n\n### Buildpack\nIf you get the following error,\n\n```\nerror while loading shared libraries: libnss3.so: cannot open shared object file: No such file or directory\n```\n\nYou need to manually add the following buildpack through the Heroku UI (Dashboard -\u003e {Your App} -\u003e Settings -\u003e Buildpacks).\n\n- https://github.com/CoffeeAndCode/puppeteer-heroku-buildpack.git\n \n\n## License\nMIT","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaeluno%2Fweb-page-dumper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmichaeluno%2Fweb-page-dumper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaeluno%2Fweb-page-dumper/lists"}