{"id":13393353,"url":"https://github.com/emadehsan/thal","last_synced_at":"2025-05-15T14:08:13.193Z","repository":{"id":48164617,"uuid":"101297021","full_name":"emadehsan/thal","owner":"emadehsan","description":"Getting started with Puppeteer and Chrome Headless for Web Scraping","archived":false,"fork":false,"pushed_at":"2020-10-28T11:22:32.000Z","size":649,"stargazers_count":2358,"open_issues_count":0,"forks_count":206,"subscribers_count":52,"default_branch":"master","last_synced_at":"2025-04-07T17:06:19.788Z","etag":null,"topics":["chrome-headless","mongodb","mongoose","nodejs","puppeteer","scraping"],"latest_commit_sha":null,"homepage":"https://emadehsan.com","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/emadehsan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-08-24T13:20:02.000Z","updated_at":"2025-02-11T03:31:57.000Z","dependencies_parsed_at":"2022-08-12T19:40:59.526Z","dependency_job_id":null,"html_url":"https://github.com/emadehsan/thal","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emadehsan%2Fthal","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emadehsan%2Fthal/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emadehsan%2Fthal/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emadehsan%2Fthal/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/emadehsan","download_url":"https://codeload.github.com/emadehsan/thal/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254355335,"owners_count":22057354,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chrome-headless","mongodb","mongoose","nodejs","puppeteer","scraping"],"created_at":"2024-07-30T17:00:50.979Z","updated_at":"2025-05-15T14:08:08.181Z","avatar_url":"https://github.com/emadehsan.png","language":"JavaScript","funding_links":[],"categories":["Opensource projects","JavaScript","📦 Legacy \u0026 Inactive Projects"],"sub_categories":[],"readme":"\n# Getting started with Puppeteer and Chrome Headless for Web Scraping\n\n**Here is a link to [Medium Article](https://medium.com/@e_mad_ehsan/getting-started-with-puppeteer-and-chrome-headless-for-web-scrapping-6bf5979dee3e)**\n\n**Here is the [Chinese Version](https://github.com/csbun/thal) thanks to [@csbun](https://github.com/csbun/)** \n\n![A Desert in painters perception](./media/desertious.jpg)\n\n[`Puppeteer`](https://github.com/GoogleChrome/puppeteer) is official tool for Chrome Headless by Google Chrome team. Since the official announcement of Chrome Headless, many of the industry standard libraries for automated testing have been discontinued by their maintainers. Including **PhantomJS**. **Selenium IDE for Firefox** has been discontinued due to lack of maintainers.\n\nFor sure, Chrome being the market leader in web browsing, **Chrome Headless** is going to industry leader in **Automated Testing** of web applications. So, I have put together this starter guide on how to get started with `Web Scraping` in **Chrome Headless**.\n\n## TL;DR\nIn this guide we will scrape GitHub, login to it and extract and save emails of users using `Chrome Headless`, `Puppeteer`, `Node` and `MongoDB`. Don't worry GitHub have rate limiting mechanism in place to keep you under control but this post will give you good idea on Scrapping with Chrome Headless and Node. Also, alway stay updated with the [documentation](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md) because `Puppeteer` is under development and APIs are prone to changes.\n\n## Getting Started\nBefore we start, we need following tools installed. Head over to their websites and install them.\n* [Node 8.+](https://nodejs.org)\n* [MongoDB](http://mongodb.com)\n\n## Project setup\n\nStart off by making the project directory\n\n```\n$ mkdir thal\n$ cd thal\n```\n\nInitiate NPM. And put in the necessary details.\n\n```\n$ npm init\n```\n\nInstall `Puppeteer`. Its not stable and repository is updated daily. If you want to avail the latest functionality you can install it directly from its GitHub repository.\n\n```\n$ npm i --save puppeteer\n```\n\nPuppeteer includes its own chrome / chromium, that is guaranteed to work headless. So each time you install / update puppeteer, it will download its specific chrome version.\n\n## Coding\nWe will start by taking a screenshot of the page. This is code from their documentation.\n\n### Screenshot\n\n```js\nconst puppeteer = require('puppeteer');\n\nasync function run() {\n  const browser = await puppeteer.launch();\n  const page = await browser.newPage();\n\n  await page.goto('https://github.com');\n  await page.screenshot({ path: 'screenshots/github.png' });\n\n  browser.close();\n}\n\nrun();\n```\n\nIf its your first time using `Node` 7 or 8, you might be unfamiliar with `async` and `await` keywords. To put  `async/await` in really simple words, an async function returns a Promise. The promise when resolves might return the result that you asked for. But to do this in a single line, you tie the call to async function with `await`.\nSave this in `index.js` inside project directory.\n\nAlso create the screenshots dir.\n\n```\n$ mkdir screenshots\n```\n\nRun the code with\n\n```\n$ node index.js\n```\n\nThe screenshot is now saved inside `screenshots/` dir.\n\n![GitHub](./screenshots/github.png)\n\n### Login to GitHub\nIf you go to GitHub and search for *john*, then click the users tab. You will see list of all users with names.\n\n![Johns](./media/all-johns.png)\n\nSome of them have made their emails publicly visible and some have chosen not to. But the thing is you can't see these emails without logging in. So, lets login. We will make heavy use of [Puppeteer documentation](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md).\n\nAdd a file `creds.js` in project root. I highly recommend signing up for new account with a new dummy email because you **might** end up getting your account blocked.\n\n```js\nmodule.exports = {\n    username: '\u003cGITHUB_USERNAME\u003e',\n    password: '\u003cGITHUB_PASSWORD\u003e'\n}\n```\n\nAdd another file `.gitignore` and put following content inside it:\n\n```txt\n\nnode_modules/\ncreds.js\n```\n\n#### Launch in non headless\nFor visual debugging, make chrome launch with GUI by passing an object with `headless: false` to `launch` method.\n\n```js\nconst browser = await puppeteer.launch({\n  headless: false\n});\n```\n\nLets navigate to login\n\n```js\nawait page.goto('https://github.com/login');\n```\n\nOpen [https://github.com/login](https://github.com/login) in your browser. Right click on input box below **Username or email address** and select `Inspect`. From developers tool, right click on the highlighted code and\nselect `Copy` then `Copy selector`.\n\n![Copy dom element selector](./media/copy-selector.png)\n\nPaste that value to following constant\n\n```js\nconst USERNAME_SELECTOR = '#login_field'; // \"#login_field\" is the copied value\n```\n\nRepeat the process for Password input box and Sign in button. You would have following\n\n```js\n// dom element selectors\nconst USERNAME_SELECTOR = '#login_field';\nconst PASSWORD_SELECTOR = '#password';\nconst BUTTON_SELECTOR = '#login \u003e form \u003e div.auth-form-body.mt-3 \u003e input.btn.btn-primary.btn-block';\n```\n\n#### Logging in\nPuppeteer provides methods `click` to click a DOM element and `type` to type text in some input box. Let's fill in the credentials then click login and wait for redirect.\n\nUp on top, require `creds.js` file.\n\n```js\nconst CREDS = require('./creds');\n```\n\nAnd then\n\n```js\nawait page.click(USERNAME_SELECTOR);\nawait page.keyboard.type(CREDS.username);\n\nawait page.click(PASSWORD_SELECTOR);\nawait page.keyboard.type(CREDS.password);\n\nawait Promise.all([\n  page.click(BUTTON_SELECTOR),\n  page.waitForNavigation()\n])\n\n```\n\n### Search GitHub\nNow, we have logged in. We can programmatically click on search box, fill it and on the results page, click users tab. But there's an easy way. Search requests are usually GET requests. So, every thing is sent via url. So, manually type `john` inside search box and then click users tab and copy the url. It would be\n\n```js\nconst searchUrl = 'https://github.com/search?q=john\u0026type=Users\u0026utf8=%E2%9C%93';\n```\n\nRearranging a bit\n\n```js\nconst userToSearch = 'john';\nconst searchUrl = `https://github.com/search?q=${userToSearch}\u0026type=Users\u0026utf8=%E2%9C%93`;\n```\n\nLets navigate to this page and wait to see if it actually searched?\n\n```js\nawait page.goto(searchUrl);\nawait page.waitFor(2*1000);\n```\n\n### Extract Emails\nWe are interested in extracting `username` and `email` of users. Lets copy the DOM element selectors like we did above.\n\n```js\nconst LIST_USERNAME_SELECTOR = '#user_search_results \u003e div.user-list \u003e div:nth-child(1) div.d-flex \u003e div \u003e a';\nconst LIST_EMAIL_SELECTOR = '#user_search_results \u003e div.user-list \u003e div:nth-child(1) div.d-flex \u003e div \u003e ul \u003e li:nth-child(2) \u003e a';\n\nconst LENGTH_SELECTOR_CLASS = 'user-list-item';\n```\n\nYou can see that I also added `LENGTH_SELECTOR_CLASS` above. If you look at the github page's code inside developers tool, you will observe that `div`s with class `user-list-item` are actually housing information about a single user each.\n\nCurrently one way to extract text from an element is by using `evaluate` method of `Page` or `ElementHandle`. When we navigate to page with search results, we will use `page.evaluate` method to get the length of users list on the page. The `evaluate` method evaluates the code inside browser context.\n\n```js\nlet listLength = await page.evaluate((sel) =\u003e {\n    return document.getElementsByClassName(sel).length;\n  }, LENGTH_SELECTOR_CLASS);\n```\n\nLet's loop through all the listed users and extract emails. As we loop through the DOM, we have to change index inside the selectors to point to the next DOM element. So, I put the `INDEX` string at the place where we want to place the index as we loop through.\n\n```js\n  // const LIST_USERNAME_SELECTOR = '#user_search_results \u003e div.user-list \u003e div:nth-child(1) div.d-flex \u003e div \u003e a';\nconst LIST_USERNAME_SELECTOR = '#user_search_results \u003e div.user-list \u003e div:nth-child(INDEX) div.d-flex \u003e div \u003e a';\n  // const LIST_EMAIL_SELECTOR = '#user_search_results \u003e div.user-list \u003e div:nth-child(1) div.d-flex \u003e div \u003e ul \u003e li:nth-child(2) \u003e a';\nconst LIST_EMAIL_SELECTOR = '#user_search_results \u003e div.user-list \u003e div:nth-child(INDEX) div.d-flex \u003e div \u003e ul \u003e li:nth-child(2) \u003e a';\nconst LENGTH_SELECTOR_CLASS = 'user-list-item';\n```\n\nThe loop and extraction\n\n```js\nfor (let i = 1; i \u003c= listLength; i++) {\n    // change the index to the next child\n    let usernameSelector = LIST_USERNAME_SELECTOR.replace(\"INDEX\", i);\n    let emailSelector = LIST_EMAIL_SELECTOR.replace(\"INDEX\", i);\n\n    let username = await page.evaluate((sel) =\u003e {\n        return document.querySelector(sel).getAttribute('href').replace('/', '');\n      }, usernameSelector);\n\n    let email = await page.evaluate((sel) =\u003e {\n        let element = document.querySelector(sel);\n        return element? element.innerHTML: null;\n      }, emailSelector);\n\n    // not all users have emails visible\n    if (!email)\n      continue;\n\n    console.log(username, ' -\u003e ', email);\n\n    // TODO save this user\n  }\n```\n\nNow if you run the script with `node index.js` you would see usernames and there corresponding emails printed.\n\n### Go over all the pages\nFirst we would estimate the last page number with search results. At search results page, on top, you can see **69,769 users** at the time of this writing.\n\n**Fun Fact: If you compare with the previous screenshot of the page, you will notice that 6 more *john* s have joined GitHub in the matter of a few hours.**\n\n![Number of search items](./media/num-results.png)\n\nCopy its selector from developer tools. We would write a new function below the `run` function to return the number of pages we can go through.\n\n```js\nasync function getNumPages(page) {\n  const NUM_USER_SELECTOR = '#js-pjax-container \u003e div.container \u003e div \u003e div.column.three-fourths.codesearch-results.pr-6 \u003e div.d-flex.flex-justify-between.border-bottom.pb-3 \u003e h3';\n\n  let inner = await page.evaluate((sel) =\u003e {\n    let html = document.querySelector(sel).innerHTML;\n    \n    // format is: \"69,803 users\"\n    return html.replace(',', '').replace('users', '').trim();\n  }, NUM_USER_SELECTOR);\n\n  let numUsers = parseInt(inner);\n\n  console.log('numUsers: ', numUsers);\n\n  /*\n  * GitHub shows 10 resuls per page, so\n  */\n  let numPages = Math.ceil(numUsers / 10);\n  return numPages;\n}\n```\n\nAt the bottom of the search results page, if you hover the mouse over buttons with page numbers, you can see they link to the next pages. The link to 2nd page with\nresults is `https://github.com/search?p=2\u0026q=john\u0026type=Users\u0026utf8=%E2%9C%93`. Notice the `p=2` query parameter in the URL. This will help us navigate to the next page.\n\nAfter adding an outer loop to go through all the pages around our previous loop, the code looks like\n\n```js\nlet numPages = await getNumPages(page);\n\nconsole.log('Numpages: ', numPages);\n\nfor (let h = 1; h \u003c= numPages; h++) {\n\n\tlet pageUrl = searchUrl + '\u0026p=' + h;\n\t\n\tawait page.goto(pageUrl);\n\t\n\tlet listLength = await page.evaluate((sel) =\u003e {\n\t\treturn document.getElementsByClassName(sel).length;\n\t}, LENGTH_SELECTOR_CLASS);\n\t\n\tfor (let i = 1; i \u003c= listLength; i++) {\n\t\t// change the index to the next child\n\t\tlet usernameSelector = LIST_USERNAME_SELECTOR.replace(\"INDEX\", i);\n\t\tlet emailSelector = LIST_EMAIL_SELECTOR.replace(\"INDEX\", i);\n\t\t\n\t\tlet username = await page.evaluate((sel) =\u003e {\n\t\t\treturn document.querySelector(sel).getAttribute('href').replace('/', '');\n\t\t}, usernameSelector);\n\t\t\n\t\tlet email = await page.evaluate((sel) =\u003e {\n\t\t\tlet element = document.querySelector(sel);\n\t\t\treturn element? element.innerHTML: null;\n\t\t}, emailSelector);\n\t\t\n\t\t// not all users have emails visible\n\t\tif (!email)\n\t\t\tcontinue;\n\t\t\t\n\t\tconsole.log(username, ' -\u003e ', email);\n\t\t\n\t\t// TODO save this users\n\t}\n}\n```\n\n### Save to MongoDB\nThe part with `puppeteer` is over now. We will use `mongoose` to store the information in to `MongoDB`. Its an [ORM](https://en.wikipedia.org/wiki/Object-relational_mapping), actually just a library to facilitate information storage and retrieval from the database.\n\n```\n$ npm i --save mongoose\n```\n\nMongoDB is a Schema-less NoSQL database. But we can make it follow some rules using Mongoose. First we would have to create a `Model` which is just representation of MongoDB `Collection` in code. Create a directory `models`. Create a file `user.js` inside and put the following code in it, the structure of our collection. Next whenever we insert something into `users` collection with mongoose, it would have to follow this structure.\n\n```js\n\nconst mongoose = require('mongoose');\n\nlet userSchema = new mongoose.Schema({\n    username: String,\n    email: String,\n    dateCrawled: Date\n});\n\nlet User = mongoose.model('User', userSchema);\n\nmodule.exports = User;\n```\n\nLet's now actually insert. We don't want duplicate emails in our database. So, we only insert a user's information if the email is not already present. Otherwise we would just update the information. For this we would use mongoose's `Model.findOneAndUpdate` method.\n\nAt the top of `index.js` add the imports\n\n```js\nconst mongoose = require('mongoose');\nconst User = require('./models/user');\n```\n\nAdd the following function at bottom of `index.js` to **upsert** (update or insert) the User model\n\n```js\nfunction upsertUser(userObj) {\n\n\tconst DB_URL = 'mongodb://localhost/thal';\n\t\n\tif (mongoose.connection.readyState == 0) {\n\t\tmongoose.connect(DB_URL);\n\t}\n\t\n\t// if this email exists, update the entry, don't insert\n\tconst conditions = { email: userObj.email };\n\tconst options = { upsert: true, new: true, setDefaultsOnInsert: true };\n\t\n\tUser.findOneAndUpdate(conditions, userObj, options, (err, result) =\u003e {\n\t\tif (err) throw err;\n\t});\n}\n```\n\nStart MongoDB server. Put following code inside the for loops at the place of comment `// TODO save this user` in order to save the user\n\n```js\nupsertUser({\n  username: username,\n  email: email,\n  dateCrawled: new Date()\n});\n```\n\nTo check if you are actually getting users saved, get inside mongo shell\n\n```\n$ mongo\n\u003e use thal\n\u003e db.users.find().pretty()\n```\n\nYou would see multiple users added there. This marks the crux of this guide.\n\n## Conclusion\nChrome Headless and Puppeteer is the start of a new era in Web Scraping and Automated Testing. Chrome Headless also supports WebGL. You can deploy your scraper in cloud and sit back and let it do the heavy load. Remember to remove the `headless: false` option when you deploy on server.\n\n* While scraping, you might be halted by GitHub's rate limiting\n\n![Whoa](./media/whoa.png)\n\n* Another thing I noticed, you cannot go beyond 100 pages on GitHub.\n\n## End note\nDeserts symbolize vastness and are witness of the struggles and sacrifices of people who `traversed` through these giant mountains of sand. [**Thal**](https://en.wikipedia.org/wiki/Thal_Desert) is a desert in Pakistan spanning across multiple districts including my home district Bhakkar. Somewhat similar is the case with `Internet` that we `traversed` today in quest of data. That's why I named the repository `Thal`. If you like this effort, please like and share this with others. If you have any suggestions, comment here or approach me directly [@e_mad_ehsan](https://twitter.com/e_mad_ehsan). I would love to hear from you.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femadehsan%2Fthal","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Femadehsan%2Fthal","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femadehsan%2Fthal/lists"}