{"id":18710764,"url":"https://github.com/apify/actor-crawler-puppeteer","last_synced_at":"2025-11-03T16:33:09.791Z","repository":{"id":49408001,"uuid":"158679469","full_name":"apify/actor-crawler-puppeteer","owner":"apify","description":"DEPRECATED: An Apify actor that enables crawling of websites using headless Chrome and Puppeteer. The actor is highly customizable and supports recursive crawling of websites as well as lists of URLs.","archived":false,"fork":false,"pushed_at":"2022-07-25T12:46:36.000Z","size":72,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-07-18T09:02:21.082Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://www.apify.com/apify/crawler-puppeteer","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apify.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-11-22T10:07:26.000Z","updated_at":"2025-07-17T19:34:01.000Z","dependencies_parsed_at":"2022-09-14T21:41:36.038Z","dependency_job_id":null,"html_url":"https://github.com/apify/actor-crawler-puppeteer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/apify/actor-crawler-puppeteer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Factor-crawler-puppeteer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Factor-crawler-puppeteer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Factor-crawler-puppeteer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Factor-crawler-puppeteer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apify","download_url":"https://codeload.github.com/apify/actor-crawler-puppeteer/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Factor-crawler-puppeteer/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266332534,"owners_count":23912662,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-21T11:47:31.412Z","response_time":64,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T12:35:36.126Z","updated_at":"2025-11-03T16:33:09.760Z","avatar_url":"https://github.com/apify.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DEPRECATED: Apify Crawler Puppeteer\n\nVisit https://github.com/apifytech/actor-scraper/tree/master/puppeteer-scraper for the current version.\n\n------------------------\n\u003c!-- toc --\u003e\n\n- [How it works](#how-it-works)\n- [Input](#input)\n- [Page function](#page-function)\n  * [`context`](#context)\n    + [Data structures:](#data-structures)\n    + [Functions:](#functions)\n    + [Class instances:](#class-instances)\n      - [Global Store](#global-store)\n- [Output](#output)\n  * [Dataset](#dataset)\n\n\u003c!-- tocstop --\u003e\n\n## How it works\nCrawler Puppeteer is the most powerful crawler tool in our arsenal (aside from developing your own actors).\nIt uses the Puppeteer library to programmatically control a headless Chrome browser and it can make it do\nalmost anything. If using the Crawler does not cut it, Crawler Puppeteer is what you need.\n\nThe downside is that [Puppeteer](https://github.com/GoogleChrome/puppeteer/) is a Node.js library,\nso knowledge of Node.js and its paradigms is expected when working with the Crawler Puppeteer.\n\nIf you need either a more performant, or a simpler tool, see the \n[crawler-cheerio](https://www.apify.com/apify/crawler-cheerio) for unmatched performance,\nor [crawler](https://www.apify.com/apify/crawler) for a plain old JavaScript tool.\n\n## Input\nInput is provided via the pre-configured UI. See the tooltips for more info on the available options.\n\n## Page function\nPage function is a single JavaScript function that enables the user to control the Crawler's operation,\nmanipulate the crawled pages and extract data as needed. It is invoked with a `context` object\ncontaining the following properties:\n\n```js\nconst context = {\n    // USEFUL DATA\n    input, // Unaltered original input as parsed from the UI\n    env, // Contains information about the run such as actorId or runId\n    customData, // Value of the 'Custom data' Crawler option.\n    request, // Apify.Request object.\n    response, // Response object holding the status code and headers.\n    \n    // EXPOSED FUNCTIONS\n    saveSnapshot, // Saves a screenshot and full HTML of the current page to the key value store.\n    skipLinks, // Prevents enqueueing more links via Pseudo URLs on the current page.\n    skipOutput, // Prevents saving the return value of the pageFunction to the default dataset.\n    enqueuePage, // Adds a page to the request queue.\n    jQuery, // A reference to the jQuery $ function (if injectJQuery was used).\n    \n    // EXPOSED OBJECTS\n    globalStore, // Represents an in memory store that can be used to share data across pageFunction invocations.\n    requestList, // Reference to the run's default Apify.RequestList.\n    requestQueue, // Reference to the run's default Apify.RequestQueue.\n    dataset, // Reference to the run's default Apify.Dataset.\n    keyValueStore, // Reference to the run's default Apify.KeyValueStore.\n    log, // Reference to Apify.utils.log \n    underscoreJs, // A reference to the Underscore _ object (if injectUnderscore was used).\n}\n```\n### `context`\nThe following tables describe the `context` object in more detail.\n\n#### Data structures:\n\u003ctable\u003e\n\u003cthead\u003e\n    \u003ctr\u003e\u003ctd\u003eArgument\u003c/td\u003e\u003ctd\u003eType\u003c/td\u003e\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003einput\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003estring\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        Raw input as it was received from the UI, represented as a \u003ccode\u003estring\u003c/code\u003e for immutability.\n        You can \u003ccode\u003eJSON.parse()\u003c/code\u003e it to get the values of individual configuration options.\n    \u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003eenv\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003eObject\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        A map of all the relevant environment variables that you may want to use. See the\n        \u003ca href=\"https://sdk.apify.com/docs/api/apify#apifygetenv-code-object-code\" target=\"_blank\"\u003e\u003ccode\u003eApify.getEnv()\u003c/code\u003e\u003c/a\u003e\n        function for a preview of the structure and full documentation.\n    \u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003ecustomData\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003eObject\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        Since the input UI is fixed, it does not support adding of other fields that may be needed for all\n        specific use cases. If you need to pass arbitrary data to the crawler, use the Custom data input field\n        and its contents will be available under the \u003ccode\u003ecustomData\u003c/code\u003e context key.\n    \u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003erequest\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003eRequest\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        Apify uses a \u003ccode\u003erequest\u003c/code\u003e object to represent metadata about the currently crawled page,\n        such as its URL or the number of retries. See the\n        \u003ca href=\"https://sdk.apify.com/docs/api/request\" target=\"_blank\"\u003e\u003ccode\u003eRequest\u003c/code\u003e\u003c/a\u003e\n        class for a preview of the structure and full documentation.\n    \u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003eresponse\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003e{status: number, headers: Object}\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        The HTTP response object is produced by Puppeteer. Currently, we only pass the HTTP status code\n        and the response headers to the \u003ccode\u003econtext\u003c/code\u003e.\n    \u003c/td\u003e\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n#### Functions:\n\u003ctable\u003e\n\u003cthead\u003e\n    \u003ctr\u003e\u003ctd\u003eArgument\u003c/td\u003e\u003ctd\u003eType\u003c/td\u003e\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003esaveSnapshot\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003eFunction\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        A helper function that enables saving a snapshot of the current page's HTML and its screenshot\n        into the default key value store. Each snapshot overwrites the previous one and the function's\n        invocations will also be throttled if invoked more than once in 2 seconds, to prevent abuse.\n        So make sure you don't call it for every single request. You can find the screenshot under\n        the SNAPSHOT-SCREENSHOT key and the HTML under the SNAPSHOT-HTML key.\n    \u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003eskipLinks\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003eFunction\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        With each invocation of the \u003ccode\u003epageFunction\u003c/code\u003e the crawler attempts to extract\n        new URLs from the page using the Link selector and PseudoURLs provided in the input UI.\n        If you want to prevent this behavior in certain cases, call the \u003ccode\u003eskipLinks\u003c/code\u003e\n        function and no URLs will be added to the queue for the given page.\n    \u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003eskipOutput\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003eFunction\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        Since each return value of the \u003ccode\u003epageFunction\u003c/code\u003e is saved to the default dataset,\n        this provides a way of overriding that functionality. Just call \u003ccode\u003eskipOutput\u003c/code\u003e\n        and the result of the current invocation will not be saved to the dataset.\n    \u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003eenqueuePage\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003eFunction\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        To enqueue a specific URL manually instead of automatically by a combination of a Link selector\n        and a Pseudo URL, use the \u003ccode\u003eenqueuePage\u003c/code\u003e function. It accepts a plain object as argument\n        that needs to have the structure to construct a\n        \u003ca href=\"https://sdk.apify.com/docs/api/request\" target=\"_blank\"\u003e\u003ccode\u003eRequest\u003c/code\u003e\u003c/a\u003e object.\n        But frankly, you just need a URL: \u003ccode\u003e{ url: 'https://www.example.com }\u003c/code\u003e\n    \u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003ejQuery\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003eFunction\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        To make the DOM manipulation within the page easier, you may choose the \u003ccode\u003einjectJQuery\u003c/code\u003e\n        option in the UI and all the crawled pages will have an instance of the\n        \u003ca href=\"https://sdk.apify.com/docs/api/request\" target=\"_blank\"\u003e\u003ccode\u003ejQuery\u003c/code\u003e\u003c/a\u003e library\n        available. However, since we do not want to modify the page in any way, we don't inject it\n        into the global \u003ccode\u003e$\u003c/code\u003e object as you may be used to, but instead we make it available\n        in \u003ccode\u003econtext\u003c/code\u003e. Feel free to \u003ccode\u003econst $ = context.jQuery\u003c/code\u003e to get the familiar notation.\n    \u003c/td\u003e\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n#### Class instances:\n##### Global Store\n`globalStore` represents an instance of a very simple in memory store that is not scoped to the individual\n`pageFunction` invocation. This enables you to easily share global data such as API responses, tokens and other.\nSince the stored data need to cross the from the Browser to the Node.js process, they cannot be any data,\nbut always need to be JSON stringifiable. Therefore, you cannot store DOM objects, live class instances,\nfunctions etc. Only a JSON representation of the passed object will be stored, with all the relevant limitations.\n\n\u003ctable\u003e\n\u003cthead\u003e\n    \u003ctr\u003e\u003ctd\u003eMethod\u003c/td\u003e\u003ctd\u003eReturn Type\u003c/td\u003e\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003eget(key:string)\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003ePromise\u0026lt;Object\u0026gt;\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        Retrieves a JSON serializable value from the global store using the provided key.\n    \u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003eset(key:string, value:Object)\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003ePromise\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        Saves a JSON serializable value to the global store using the provided key.\n    \u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003esize()\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003ePromise\u0026lt;number\u0026gt;\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        Returns the current number of values in the global store.\n    \u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003ccode\u003elist()\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ccode\u003ePromise\u0026lt;Array\u0026gt;\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\n        Returns all the keys currently stored in the global store.\n    \u003c/td\u003e\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n## Output\n\nOuput is a dataset containing extracted data for each scraped page.\n\n### Dataset\nFor each of the scraped URLs, the dataset contains an object with results and some metadata.\nIf you were scraping the HTML `\u003ctitle\u003e` of [IANA](https://www.iana.org/) it would look like this:\n\n```json\n{\n  \"title\": \"Internet Assigned Numbers Authority\",\n  \"#error\": false,\n  \"#debug\": {\n    \"url\": \"https://www.iana.org/\",\n    \"method\": \"GET\",\n    \"retryCount\": 0,\n    \"errorMessages\": null,\n    \"requestId\": \"e2Hd517QWfF4tVh\"\n  }\n}\n```\n\nThe metadata are prefixed with a `#`. Soon you will be able to exclude the metadata\nfrom the results by providing an API flag.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapify%2Factor-crawler-puppeteer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapify%2Factor-crawler-puppeteer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapify%2Factor-crawler-puppeteer/lists"}