{"id":27629366,"url":"https://github.com/hasdata/playwright-scraping","last_synced_at":"2025-07-08T06:07:05.680Z","repository":{"id":289276382,"uuid":"970714818","full_name":"HasData/playwright-scraping","owner":"HasData","description":"This repository demonstrates web scraping and browser automation using Playwright in both Python and Node.js. It includes scripts for common tasks such as scraping data, interacting with web elements, handling authentication, and managing errors.  ","archived":false,"fork":false,"pushed_at":"2025-04-22T15:14:14.000Z","size":352,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-23T15:16:41.810Z","etag":null,"topics":["javascript","playwright","python","scraping","scraping-node","scraping-python"],"latest_commit_sha":null,"homepage":"https://hasdata.com","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HasData.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-22T12:32:14.000Z","updated_at":"2025-04-22T15:14:17.000Z","dependencies_parsed_at":"2025-04-23T15:16:41.573Z","dependency_job_id":null,"html_url":"https://github.com/HasData/playwright-scraping","commit_stats":null,"previous_names":["hasdata/playwright-scraping"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/HasData/playwright-scraping","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HasData%2Fplaywright-scraping","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HasData%2Fplaywright-scraping/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HasData%2Fplaywright-scraping/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HasData%2Fplaywright-scraping/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HasData","download_url":"https://codeload.github.com/HasData/playwright-scraping/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HasData%2Fplaywright-scraping/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264207147,"owners_count":23572738,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["javascript","playwright","python","scraping","scraping-node","scraping-python"],"created_at":"2025-04-23T15:16:40.471Z","updated_at":"2025-07-08T06:07:05.655Z","avatar_url":"https://github.com/HasData.png","language":"JavaScript","readme":"![Python](https://img.shields.io/badge/python-3.7+-blue)\n![Node.js](https://img.shields.io/badge/node.js-18+-green)\n![Playwright](https://img.shields.io/badge/playwright-1.43.0-blueviolet)\n\n\n# Playwright Web Scraping Examples (Python \u0026 Node.js)\n[![HasData_bannner](banner.png)](https://hasdata.com/)\n\nThis repository contains practical web scraping examples using **[Playwright](https://playwright.dev/)** in both **Python** and **Node.js**. It’s organized to help you learn everything — from basics to advanced techniques.\n\n\n## Table of Contents\n\n1. [Requirements](#requirements)\n2. [Project Structure](#project-structure)\n3. [Web Scraping With Playwright](#scripts)\n   - [Basics](#basics)\n   - [Scraping](#scraping)\n   - [Selectors](#selectors)\n   - [Interactions](#interactions)\n   - [Save Data](#save-data)\n   - [Auth](#auth)\n   - [Browser](#browser)\n   - [Errors](#errors)\n\n## Requirements\n\n**Python 3.7+ or Node.js 18+**\n\nPlaywright installed:\n\n### Python\n```\npip install playwright\nplaywright install\n```\n### Node.js\n```\nnpm install playwright\n```\n\n## Project Structure\nThe project is organized into two main folders: one for Python scripts and one for Node.js. Both have identical folder structures and contain the same functionality, but in different languages (Python for python/ and Node.js for node/).\n```\n.\n├── Python/\n│   ├── basics/\n│   │   ├── launch_browser.py\n│   │   ├── headless_vs_headful.py\n│   │   └── open_multiple_tabs.py\n│   ├── scraping/\n│   │   ├── extract_text_title.py\n│   │   ├── extract_links.py\n│   │   ├── extract_images.py\n│   │   ├── scrape_shadow_dom.py\n│   │   ├── wait_for_element.py\n│   │   ├── scrape_products_amazon.py\n│   │   └── scrape_woocommerce.py\n│   ├── selectors/\n│   │   ├── select_by_css.py\n│   │   ├── select_by_xpath.py\n│   │   ├── select_by_role.py\n│   │   └── select_by_text.py\n│   ├── interactions/\n│   │   ├── click_button.py\n│   │   ├── fill_form.py\n│   │   ├── select_dropdown.py\n│   │   ├── hover_element.py\n│   │   ├── click_pagination.py\n│   │   └── infinite_scroll.py\n│   ├── save data/\n│   │   ├── save_json.py\n│   │   ├── save_csv.py\n│   │   ├── save_pdf.py\n│   │   ├── download_files.py\n│   │   └── screenshot_element.py\n│   ├── auth/\n│   │   ├── basic_auth.py\n│   │   ├── save_cookies.py\n│   │   └── reuse_cookies.py\n│   ├── browser/\n│   │   ├── set_user_agent.py\n│   │   ├── use_proxy.py\n│   │   └── emulate_device.py\n│   ├── errors/\n│   │   ├── retry_failed_requests.py\n│   └── debug/\n│       ├── record_video.py\n│       ├── record_trace.py\n│       ├── pause_script.py\n│       └── debug_console.py\n├── NodeJS/\n│   └── [Same folder structure with .js equivalents]\n```\n\nEach script demonstrates a specific feature of Playwright and can be run independently.\n\n## Basics\n\nThis section has basic scripts that show how to launch a browser with Playwright, switch between headless and headful modes, and open multiple tabs in the same browser.\n\nThe table below lists the main commands for that. \n\n| Description | Python | Node.js |\n|------------|--------|---------|\n| Launch browser | `browser = await playwright.chromium.launch()` | `const browser = await chromium.launch();` |\n| Headless mode | `launch(headless=False)` | `launch({ headless: false })` |\n| Open multiple tabs | `context.new_page()` | `context.newPage()` |\n\nYou can check out the full scripts in the project folder.\n\n## Scraping\n\nThis section contains scripts for extracting text, links, images, and working with complex elements like Shadow DOM or delayed content.\n\n| Description | Python | Node.js |\n|------------|--------|---------|\n| Get page title | `title = await page.title()` | `const title = await page.title();` |\n| Extract links | `page.eval_on_selector_all(\"a\", \"...\")` | `page.$$eval('a', ...)` |\n| Get image URLs | `img.get_attribute(\"src\")` | `img.getAttribute(\"src\")` |\n| Scrape Shadow DOM | `locator = page.locator('css=shadow-root-selector')` | `page.locator('css=shadow-root-selector')` |\n| Wait for element | `await page.wait_for_selector(\".class\")` | `await page.waitForSelector('.class')` |\n\n\nThis part of the project includes two ready-to-use scrapers implemented in both Python and Node.js:\n\n- `scrape_products_amazon.py` / `scrape_products_amazon.js`\n- `scrape_woocommerce.py` / `scrape_woocommerce.js`\n\nIf you want to learn how to build the similar scrapers step by step, check out the detailed guides:\n\n- [How to Scrape Amazon](https://hasdata.com/blog/scraping-amazon-product-data-using-python)  \n- [How to Scrape WooCommerce](https://hasdata.com/blog/woocommerce-scraping)\n\nAlternatively, you can use the no-code scrapers and APIs to quickly extract structured data from Amazon:\n\n- [Amazon Search Results](https://hasdata.com/scrapers/amazon-search-results)  \n- [Amazon Product Info](https://hasdata.com/scrapers/amazon-product)  \n- [Amazon Reviews](https://hasdata.com/scrapers/amazon-reviews)  \n- [Amazon Bestsellers](https://hasdata.com/scrapers/amazon-bestsellers)  \n- [Amazon Customer FAQs](https://hasdata.com/scrapers/amazon-customer-faqs)  \n- [Amazon Price Tracker](https://hasdata.com/scrapers/amazon-price)\n\n\n## Selectors\n\nThis section demonstrates different ways to select elements on a page using CSS selectors, XPath, roles, and text content.\n\nThe table below lists the main commands for that.\n\n| Description | Python | Node.js |\n|------------|--------|---------|\n| CSS selector | `page.locator(\"div \u003e span\")` | `page.locator(\"div \u003e span\")` |\n| XPath selector | `page.locator('//h1')` | `page.locator('//h1')` |\n| Select by role | `page.get_by_role(\"button\")` | `page.getByRole('button')` |\n| Select by text | `page.get_by_text(\"Login\")` | `page.getByText('Login')` |\n\n##  Interactions\n\nThis section covers scripts that simulate user actions like clicking buttons, filling out forms, selecting from dropdowns, hovering, and handling pagination or infinite scrolling.\nThe table below lists the main commands for that.\n\n| Description | Python | Node.js |\n|------------|--------|---------|\n| Click button | `await page.click(\"button\")` | `await page.click('button')` |\n| Fill input | `await page.fill(\"#email\", \"test@example.com\")` | `await page.fill('#email', 'test@example.com')` |\n| Select dropdown | `await page.select_option(\"select\", \"value\")` | `await page.selectOption('select', 'value')` |\n| Hover element | `await page.hover(\".menu\")` | `await page.hover('.menu')` |\n| Pagination | `await page.click(\"text=Next\")` | `await page.click('text=Next')` |\n| Infinite scroll | `await page.evaluate(\"window.scrollBy(...)\")` | `await page.evaluate(() =\u003e window.scrollBy(...))` |\n\nAvoid hardcoded delays — they’re unreliable and make your scraper brittle.\n\nDon’t do this:\n#### Python\n```python\nawait asyncio.sleep(5)\n```\n#### Node.js\n```js\nawait new Promise(r =\u003e setTimeout(r, 5000));\n```\nDo this instead:\n#### Python\n```python\nawait page.wait_for_selector(\".product-thumb\")\n```\n#### Node.js\n```js\nawait page.waitForSelector(\".product-thumb\");\n```\nWaiting for the actual element is always better than guessing how long the page needs to load.\n\n\n## Save Data\n\nThis section includes examples for saving scraped data in various formats like JSON, CSV, or PDF, and for downloading files or capturing screenshots.\nThe table below lists the main commands for that.\n\n\n| Description | Python | Node.js |\n|------------|--------|---------|\n| Save JSON | `json.dump(data, open(\"file.json\", \"w\"))` | `fs.writeFileSync('file.json', JSON.stringify(data))` |\n| Save CSV | `csv.writer(open(\"file.csv\", \"w\")).writerows(data)` | `fs.writeFileSync('file.csv', csvString)` |\n| Download files | `await page.click(\"a[download]\")` | `await page.click('a[download]')` |\n| Screenshot element | `await locator.screenshot(path=\"element.png\")` | `await locator.screenshot({ path: 'element.png' })` |\n| Save PDF | `await page.pdf(path=\"output.pdf\")` | `await page.pdf({ path: 'output.pdf' })` |\n\n## Auth\n\nThis section provides scripts for handling basic authentication, managing cookies, and reusing them across sessions.\nThe table below lists the main commands for that.\n\n| Description | Python | Node.js |\n|------------|--------|---------|\n| Basic auth | `context = browser.new_context(http_credentials={...})` | `browser.newContext({ httpCredentials: {...} })` |\n| Save cookies | `context.cookies()` → save to file | `context.cookies()` → save to file |\n| Load cookies | `context.add_cookies(cookies)` | `context.addCookies(cookies)` |\n\n## Browser\n\nThis section focuses on controlling browser behavior with custom user agents, proxies, and device emulation.\nThe table below lists the main commands for that.\n\n\n| Description | Python | Node.js |\n|------------|--------|---------|\n| Set user agent | `browser.new_context(user_agent=\"...\")` | `browser.newContext({ userAgent: \"...\" })` |\n| Use proxy | `launch(proxy={\"server\": \"http://...\"})` | `launch({ proxy: { server: \"http://...\" } })` |\n| Emulate device | `playwright.devices[\"iPhone 12\"]` | `devices['iPhone 12']` |\n\n\nIf you want to check available devices:\n#### Python\n\n```python\nprint(p.devices.keys())\n```\nThis will output something like:\n\n```python\ndict_keys(['iPhone 12', 'Pixel 5', 'Galaxy S9+', ...])\n```\n#### NodeJS\n\n```js\nconsole.log(Object.keys(devices));\n```\n\nExample output:\n\n```bash\n[\n  'Blackberry PlayBook',\n  'iPhone 12',\n  'Galaxy S9+',\n  'Pixel 5',\n  ...\n]\n```\n\n## Errors\n\nThis section contains examples for retrying failed requests and handling timeouts or unexpected responses.\nThe table below lists the main commands for that.\n\n| Description | Python | Node.js |\n|------------|--------|---------|\n| Retry logic | `for i in range(retries): try/except` | `for (let i = 0; i \u003c retries; i++) try/catch` |\n\n## Debug\n\nThis section provides tools for debugging: recording videos and traces, pausing scripts, and inspecting with console logs.\nThe table below lists the main commands for that.\n\n| Description | Python | Node.js |\n|------------|--------|---------|\n| Record video | `record_video_dir=\"videos/\"` | `recordVideo: { dir: 'videos/' }` |\n| Record trace | `trace.start()` / `trace.stop(path=\"trace.zip\")` | `tracing.start()` / `tracing.stop()` |\n| Pause script | `await page.pause()` | `await page.pause()` |\n| Console logs | `page.on(\"console\", ...)` | `page.on('console', ...)` |\n\n\nYou can check out the full scripts in the project folder.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhasdata%2Fplaywright-scraping","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhasdata%2Fplaywright-scraping","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhasdata%2Fplaywright-scraping/lists"}