{"id":26555260,"url":"https://github.com/luminati-io/crawlee-web-scraping","last_synced_at":"2026-04-17T01:31:23.035Z","repository":{"id":283783970,"uuid":"949420740","full_name":"luminati-io/crawlee-web-scraping","owner":"luminati-io","description":"Use Crawlee for efficient web scraping in Node.js, including proxy rotation, session management, and handling dynamic content.","archived":false,"fork":false,"pushed_at":"2025-03-16T12:52:45.000Z","size":441,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-03-22T07:02:01.354Z","etag":null,"topics":["crawlee","dynamic-content","node-js","npm","proxy-rotation","session-management","web-scraper"],"latest_commit_sha":null,"homepage":"https://brightdata.com/blog/web-data/web-scraping-with-crawlee","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luminati-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-16T12:30:23.000Z","updated_at":"2025-03-16T12:55:37.000Z","dependencies_parsed_at":"2025-03-22T07:02:03.483Z","dependency_job_id":"bd180933-b89e-41f4-a2e2-66327b838aeb","html_url":"https://github.com/luminati-io/crawlee-web-scraping","commit_stats":null,"previous_names":["luminati-io/crawlee-web-scraping"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/luminati-io/crawlee-web-scraping","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fcrawlee-web-scraping","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fcrawlee-web-scraping/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fcrawlee-web-scraping/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fcrawlee-web-scraping/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luminati-io","download_url":"https://codeload.github.com/luminati-io/crawlee-web-scraping/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fcrawlee-web-scraping/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31911432,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-16T18:22:33.417Z","status":"ssl_error","status_checked_at":"2026-04-16T18:21:47.142Z","response_time":69,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawlee","dynamic-content","node-js","npm","proxy-rotation","session-management","web-scraper"],"created_at":"2025-03-22T10:25:41.844Z","updated_at":"2026-04-17T01:31:23.013Z","avatar_url":"https://github.com/luminati-io.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Scraping with Crawlee\n\n[![Promo](https://github.com/luminati-io/LinkedIn-Scraper/raw/main/Proxies%20and%20scrapers%20GitHub%20bonus%20banner.png)](https://brightdata.com/) \n\nLearn how to use Crawlee for efficient [web scraping with Node.js](https://brightdata.com/blog/how-tos/web-scraping-with-node-js):\n\n- [Basic Web Scraping with Crawlee](#basic-web-scraping-with-crawlee)\n- [Proxy Rotation with Crawlee](#proxy-rotation-with-crawlee)\n- [Sessions Management with Crawlee](#sessions-management-with-crawlee)\n- [Dynamic Content Handling with Crawlee](#dynamic-content-handling-with-crawlee)\n\n## Prerequisites\n\nBefore you start, make sure you have the following prerequisites installed:\n\n* **[Node.js](https://nodejs.org/).**\n* **[npm](https://www.npmjs.com/):** This typically comes with Node.js. You can verify the installation by running `node -v` or `npm -v` in your terminal.\n* **A code editor of your choice:** This tutorial uses [Visual Studio Code](https://code.visualstudio.com/).\n\n## Basic Web Scraping with Crawlee\n\nLet’s start by scraping the [Books to Scrape](https://books.toscrape.com/) website.\n\nOpen your terminal or shell and initialize a Node.js project:\n\n```bash\nmkdir crawlee-tutorial\ncd crawlee-tutorial\nnpm init -y\n```\n\nInstall the Crawlee library:\n\n```bash\nnpm install crawlee\n```\n\nTo scrape data effectively, inspect the target website’s HTML structure. Open the site in your browser, right-click anywhere on the page, and select **Inspect** or **Inspect Element** in **Developer Tools**.\n\n![Inspect HTML element](https://github.com/luminati-io/crawlee-web-scraping/blob/main/images/Inspect-HTML-element-1024x540.png)\n\nThe **Elements** tab in **Developer Tools** displays the page’s HTML layout. In this example:  \n\n- Each book is inside an `article` tag with the class `product_pod`.  \n- The book title is in an `h3` tag, with the actual title stored in the `title` attribute of the nested `a` tag.  \n- The book price is inside a `p` tag with the class `price_color`.  \n\n![Inspect the HTML elements on the Books to Scrape website](https://github.com/luminati-io/crawlee-web-scraping/blob/main/images/Inspect-the-HTML-elements-on-the-Books-to-Scrape-website-1024x522.png)\n\nUnder the root directory of your project, create a file named `scrape.js` and add the following code:\n\n```js\nconst { CheerioCrawler } = require('crawlee');\n\nconst crawler = new CheerioCrawler({\n    async requestHandler({ request, $ }) {\n        const books = [];\n        $('article.product_pod').each((index, element) =\u003e {\n            const title = $(element).find('h3 a').attr('title');\n            const price = $(element).find('.price_color').text();\n            books.push({ title, price });\n        });\n        console.log(books);\n    },\n});\n\ncrawler.run(['https://books.toscrape.com/']);\n```\n\nThis code uses `CheerioCrawler` from `crawlee` to extract book titles and prices from `https://books.toscrape.com/`. It fetches the HTML, selects `\u003carticle class=\"product_pod\"\u003e` elements using jQuery-like syntax, and logs the results to the console.\n\nAfter adding the code to your `scrape.js` file, run it with the following command:\n\n```bash\nnode scrape.js\n```\n\nAn array of book titles and prices should print to your terminal:\n\n```\n…output omitted…\n  {\n    title: 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',\n    price: '£22.60'\n  },\n  { title: 'The Black Maria', price: '£52.15' },\n  {\n    title: 'Starving Hearts (Triangular Trade Trilogy, #1)',\n    price: '£13.99'\n  },\n  { title: \"Shakespeare's Sonnets\", price: '£20.66' },\n  { title: 'Set Me Free', price: '£17.46' },\n  {\n    title: \"Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)\",\n    price: '£52.29'\n  },\n…output omitted…\n```\n\n## Proxy Rotation with Crawlee\n\nA proxy acts as a middleman between your computer and the internet, forwarding your web requests while masking your IP address. This helps prevent rate limits and IP bans.\n\nCrawlee simplifies proxy implementation with built-in handling for retries, errors, and rotating proxies.\n\nNext, you'll set up a proxy, obtain a valid proxy address, and verify your requests are routed through it.\n\nSince free proxies are often slow, insecure, and unreliable for sensitive web tasks, consider using a trusted service like Bright Data, which provides secure, stable, and reliable proxies. It also offers free trials, allowing you to test the service before committing. \n\nTo use Bright Data, click the **Start free trial** button on their [home page](https://brightdata.com/) and fill in the required information to create an account.\n\nOnce your account is created, log in to the Bright Data dashboard, navigate to **Proxies \u0026 Scraping Infrastructure**, and add a new proxy by selecting **[Residential Proxies](/proxy-types/residential-proxies)**:\n\n![Add a residential proxy](https://github.com/luminati-io/crawlee-web-scraping/blob/main/images/Add-a-residential-proxy-1024x574.png)\n\nRetain the default settings and finalize the creation of your residential proxy by clicking **Add**.\n\nIf you are asked to install a certificate, you can select **Proceed without certificate**. However, for production and real use cases, you should set up the certificate to prevent misuse if your proxy information is ever exposed.\n\nOnce created, take note of the proxy credentials, including the host, port, username, and password. You need these in the next step:\n\n![Bright Data proxy credentials](https://github.com/luminati-io/crawlee-web-scraping/blob/main/images/Bright-Data-proxy-credentials-1024x557.png)\n\nUnder the root directory of your project, run the following command to install the [axios](https://www.npmjs.com/package/axios) library:\n\n```bash\nnpm install axios\n```\n\nThe `axios` library is used to send a GET request to `http://lumtest.com/myip.json`, which returns details about the proxy in use.\n\nTo implement this, create a file named `scrapeWithProxy.js` in your project's root directory and add the following code:\n\n```js\nconst { CheerioCrawler } = require(\"crawlee\");\nconst { ProxyConfiguration } = require(\"crawlee\");\nconst axios = require(\"axios\");\n\nconst proxyConfiguration = new ProxyConfiguration({\n  proxyUrls: [\"http://USERNAME:PASSWORD@HOST:PORT\"],\n});\n\nconst crawler = new CheerioCrawler({\n  proxyConfiguration,\n  async requestHandler({ request, $, response, proxies }) {\n    // Make a GET request to the proxy information URL\n    try {\n      const proxyInfo = await axios.get(\"http://lumtest.com/myip.json\", {\n        proxy: {\n          host: \"HOST\",\n          port: PORT,\n          auth: {\n            username: \"USERNAME\",\n            password: \"PASSWORD\",\n          },\n        },\n      });\n      console.log(\"Proxy Information:\", proxyInfo.data);\n    } catch (error) {\n      console.error(\"Error fetching proxy information:\", error.message);\n    }\n\n    const books = [];\n    $(\"article.product_pod\").each((index, element) =\u003e {\n      const title = $(element).find(\"h3 a\").attr(\"title\");\n      const price = $(element).find(\".price_color\").text();\n      books.push({ title, price });\n    });\n    console.log(books);\n  },\n});\n\ncrawler.run([\"https://books.toscrape.com/\"]);\n```\n\n\u003e **Note:**\n\u003e \n\u003e Make sure to replace the `HOST`, `PORT`, `USERNAME`, and `PASSWORD` with your credentials.\n\nThis code uses `CheerioCrawler` from `crawlee` to scrape data from `https://books.toscrape.com/` while routing requests through a specified proxy.  \n\n- The proxy is configured using `ProxyConfiguration`.  \n- A GET request to `http://lumtest.com/myip.json` fetches and logs proxy details.  \n- Book titles and prices are extracted using Cheerio’s jQuery-like syntax and logged to the console.  \n\nRun the code to test the proxy setup and verify its functionality:\n\n```bash\nnode scrapeWithProxy.js\n```\n\nYou’ll see similar results to before, but this time, your requests are routed through Bright Data proxies. You should also see the details of the proxy logged in the console:\n\n```js\nProxy Information: {\n  country: 'US',\n  asn: { asnum: 21928, org_name: 'T-MOBILE-AS21928' },\n  geo: {\n    city: 'El Paso',\n    region: 'TX',\n    region_name: 'Texas',\n    postal_code: '79925',\n    latitude: 31.7899,\n    longitude: -106.3658,\n    tz: 'America/Denver',\n    lum_city: 'elpaso',\n    lum_region: 'tx'\n  }\n}\n[\n  { title: 'A Light in the Attic', price: '£51.77' },\n  { title: 'Tipping the Velvet', price: '£53.74' },\n  { title: 'Soumission', price: '£50.10' },\n  { title: 'Sharp Objects', price: '£47.82' },\n  { title: 'Sapiens: A Brief History of Humankind', price: '£54.23' },\n  { title: 'The Requiem Red', price: '£22.65' },\n…output omitted..\n```\n\nRunning the script with `node scrapingWithBrightData.js` should display a different IP address each time, confirming that Bright Data rotates locations and IPs automatically. This rotation helps prevent blockages and IP bans when scraping websites.\n\n\u003e **Note:**\n\u003e \n\u003e In the `proxyConfiguration`, you could have passed different proxy IPs, but since Bright Data does that for you, you don’t need to specify the IPs.\n\n## Sessions Management with Crawlee\n\nSessions help maintain state across multiple requests, especially for sites using cookies or login sessions.  \n\nTo implement session management, create a file named `scrapeWithSessions.js` in your project's root directory and add the following code:\n\n```js\nconst { CheerioCrawler, SessionPool } = require(\"crawlee\");\n\n(async () =\u003e {\n  // Open a session pool\n  const sessionPool = await SessionPool.open();\n\n  // Ensure there is a session in the pool\n  let session = await sessionPool.getSession();\n  if (!session) {\n    session = await sessionPool.createSession();\n  }\n\n  const crawler = new CheerioCrawler({\n    useSessionPool: true, // Enable session pool\n    async requestHandler({ request, $, response, session }) {\n      // Log the session information\n      console.log(`Using session: ${session.id}`);\n\n      // Extract book data and log it (for demonstration)\n      const books = [];\n      $(\"article.product_pod\").each((index, element) =\u003e {\n        const title = $(element).find(\"h3 a\").attr(\"title\");\n        const price = $(element).find(\".price_color\").text();\n        books.push({ title, price });\n      });\n      console.log(books);\n    },\n  });\n\n  // First run\n  await crawler.run([\"https://books.toscrape.com/\"]);\n  console.log(\"First run completed.\");\n\n  // Second run\n  await crawler.run([\"https://books.toscrape.com/\"]);\n  console.log(\"Second run completed.\");\n})();\n```\n\nThis code uses `CheerioCrawler` and `SessionPool` from `crawlee` to scrape data from `https://books.toscrape.com/`.  \n\n- A session pool is initialized and assigned to the crawler.  \n- The `requestHandler` logs session details and extracts book titles and prices using Cheerio selectors.  \n- The script performs two consecutive scraping runs, logging the session ID each time.  \n\nRun the code to verify that different sessions are being used.\n\n```bash\nnode scrapeWithSessions.js\n```\n\nYou should see similar results as before, but this time—with the session ID for each run:\n\n```\nUsing session: session_GmKuZ2TnVX\n[\n  { title: 'A Light in the Attic', price: '£51.77' },\n  { title: 'Tipping the Velvet', price: '£53.74' },\n…output omitted…\nUsing session: session_lNRxE89hXu\n[\n  { title: 'A Light in the Attic', price: '£51.77' },\n  { title: 'Tipping the Velvet', price: '£53.74' },\n…output omitted…\n```\n\nIf you run the code again, you should see that a different session ID is being used.\n\n## Dynamic Content Handling with Crawlee\n\nScraping **dynamic websites** (those that load content via JavaScript) can be challenging, as data is only available after rendering.  \n\nTo handle this, Crawlee integrates with [Puppeteer](https://pptr.dev/), a headless browser that renders JavaScript and interacts with web pages like a human.  \n\nFor demonstration, we'll scrape content from [this YouTube page](https://www.youtube.com/watch?v=wZ6cST5pexo). **Before scraping, always review the site's rules and terms of service.**  \n\nAfter reviewing the terms, create a file named `scrapeDynamicContent.js` in your project's root directory and add the following code:\n\n```js\nconst { PuppeteerCrawler } = require(\"crawlee\");\n\nasync function scrapeYouTube() {\n  const crawler = new PuppeteerCrawler({\n    async requestHandler({ page, request, enqueueLinks, log }) {\n      const { url } = request;\n      await page.goto(url, { waitUntil: \"networkidle2\" });\n\n      // Scraping first 10 comments\n      const comments = await page.evaluate(() =\u003e {\n        return Array.from(document.querySelectorAll(\"#comments #content-text\"))\n          .slice(0, 10)\n          .map((el) =\u003e el.innerText);\n      });\n\n      log.info(`Comments: ${comments.join(\"\\n\")}`);\n    },\n\n    launchContext: {\n      launchOptions: {\n        headless: true,\n      },\n    },\n  });\n\n  // Add the URL of the YouTube video you want to scrape\n  await crawler.run([\"https://www.youtube.com/watch?v=wZ6cST5pexo\"]);\n}\n\nscrapeYouTube();\n```\n\nThen, run the code with the following command:\n\n```bash\nnode scrapeDynamicContent.js\n```\n\nThis code uses `PuppeteerCrawler` from Crawlee to scrape comments from a YouTube video.  \n\n- The crawler navigates to a specific YouTube video URL and waits for the page to fully load.  \n- It selects the first ten comments using the CSS selector `#comments #content-text`.  \n- Extracted comments are logged to the console.  \n\nWhen executed, the script will output the first ten comments from the selected video.\n\n```\nINFO  PuppeteerCrawler: Starting the crawler.\nINFO  PuppeteerCrawler: Comments: Who are you rooting for?? US Marines or Ex Cons \nBro Mateo is a beast, no lifting straps, close stance.\nex convict doing the pushups is a monster.\nI love how quick this video was, without nonsense talk and long intros or outros\n\"They Both have combat experience\" is wicked \nThat military guy doing that deadlift is really no joke.. ...\nOne lives to fight and the other fights to live.\nFinally something that would test the real outcome on which narrative is true on movies\nI like the comradery between all of them. Especially on the Bench Press ... Both team members quickly helped out on the spotting to protect from injury. Well done.\nI like this style, no youtube funny business. Just straight to the lifts\n…output omitted…\n```\n\nYou can find all the code used in this tutorial on [GitHub](https://github.com/See4Devs/crawlee-web-scraping).\n\n## Conclusion\n\nCrawlee can help improve the efficiency and reliability of your web scraping projects. Ready to elevate your web scraping projects with professional-grade data, tools, and proxies? Explore the comprehensive web scraping platform of Bright Data, offering [ready-to-use datasets](https://brightdata.com/products/datasets) and [advanced proxy services](https://brightdata.com/proxy-types) to streamline your data collection efforts.\n\nSign up now and start your free trial!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminati-io%2Fcrawlee-web-scraping","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluminati-io%2Fcrawlee-web-scraping","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminati-io%2Fcrawlee-web-scraping/lists"}