{"id":22347265,"url":"https://github.com/get-set-fetch/scraper","last_synced_at":"2025-07-30T04:33:10.029Z","repository":{"id":37293313,"uuid":"318795524","full_name":"get-set-fetch/scraper","owner":"get-set-fetch","description":"Nodejs web scraper. Contains a command line, docker container, terraform module and ansible roles for distributed cloud scraping. Supported databases: SQLite, MySQL, PostgreSQL. Supported headless clients: Puppeteer, Playwright, Cheerio, JSdom.","archived":false,"fork":false,"pushed_at":"2023-03-13T10:40:11.000Z","size":2367,"stargazers_count":104,"open_issues_count":12,"forks_count":16,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-08-08T15:46:43.999Z","etag":null,"topics":["cloud","nodejs","scraper","scraping","web"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/get-set-fetch.png","metadata":{"files":{"readme":"README.md","changelog":"changelog.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-12-05T13:28:50.000Z","updated_at":"2024-08-06T09:10:44.000Z","dependencies_parsed_at":"2022-07-15T21:16:45.543Z","dependency_job_id":null,"html_url":"https://github.com/get-set-fetch/scraper","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/get-set-fetch%2Fscraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/get-set-fetch%2Fscraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/get-set-fetch%2Fscraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/get-set-fetch%2Fscraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/get-set-fetch","download_url":"https://codeload.github.com/get-set-fetch/scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228088846,"owners_count":17867481,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cloud","nodejs","scraper","scraping","web"],"created_at":"2024-12-04T10:08:49.267Z","updated_at":"2024-12-04T10:08:49.863Z","avatar_url":"https://github.com/get-set-fetch.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"https://get-set-fetch.github.io/get-set-fetch/logo.png\"\u003e\n\n[![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/get-set-fetch/scraper/blob/main/LICENSE)\n[![Audit Status](https://github.com/get-set-fetch/scraper/workflows/audit/badge.svg)](https://github.com/get-set-fetch/scraper/actions?query=workflow%3Aaudit)\n[![Build Status](https://github.com/get-set-fetch/scraper/workflows/test/badge.svg)](https://github.com/get-set-fetch/scraper/actions?query=workflow%3Atest)\n[![Coverage Status](https://coveralls.io/repos/github/get-set-fetch/scraper/badge.svg?branch=main)](https://coveralls.io/github/get-set-fetch/scraper?branch=main)\n\n![](https://get-set-fetch.github.io/documentation/site/assets/img/cli-demo.svg)\n\n# Node.js web scraper\n\nget-set, Fetch! is a plugin based, nodejs web scraper. It scrapes, stores and exports data. \\\nAt its core, an ordered list of plugins is executed against each to be scraped URL.\n\nSupported databases: SQLite, MySQL, PostgreSQL. \\\nSupported browser clients: Puppeteer, Playwright. \\\nSupported DOM-like clients: Cheerio, JSdom.\n\n#### Use it in your own javascript/typescript code\n```\nimport { Scraper, Project, CsvExporter } from '@get-set-fetch/scraper';\n\nconst scraper = new Scraper(ScrapeConfig.storage, ScrapeConfig.client);\nscraper.on(ScrapeEvent.ProjectScraped, async (project: Project) =\u003e {\n  const exporter = new CsvExporter({ filepath: 'languages.csv' });\n  await exporter.export(project);\n});\n\nscraper.scrape(ScrapeConfig.project, ScrapeConfig.concurrency);\n```\nNote: package is exported both as CommonJS and ES Module.\n\n#### Use it from the command line\n```\ngsfscrape \\\n--config scrape-config.json \\\n--loglevel info --logdestination scrape.log \\\n--save \\\n--overwrite \\\n--export project.csv\n```\n\n#### Run it with Docker\n```\ndocker run \\\n-v \u003chost_dir\u003e/scraper/docker/data:/home/gsfuser/scraper/data getsetfetch:latest \\\n--version \\\n--config data/scrape-config.json \\\n--save \\\n--overwrite \\\n--scrape \\\n--loglevel info \\\n--logdestination data/scrape.log \\\n--export data/export.csv\n```\nNote: you have to build the image manually from './docker' directory.\n\n#### Run it in cloud with Terraform and Ansible\n```\nmodule \"benchmark_1000k_1project_multiple_scrapers_csv_urls\" {\n  source = \"../../node_modules/@get-set-fetch/scraper/cloud/terraform\"\n\n  region                 = \"fra1\"\n  public_key_name        = \"get-set-fetch\"\n  public_key_file        = var.public_key_file\n  private_key_file       = var.private_key_file\n  ansible_inventory_file = \"../ansible/inventory/hosts.cfg\"\n\n  pg = {\n    name                  = \"pg\"\n    image                 = \"ubuntu-20-04-x64\"\n    size                  = \"s-4vcpu-8gb\"\n    ansible_playbook_file = \"../ansible/pg-setup.yml\"\n  }\n\n  scraper = {\n    count                 = 4\n    name                  = \"scraper\"\n    image                 = \"ubuntu-20-04-x64\"\n    size                  = \"s-1vcpu-1gb\"\n    ansible_playbook_file = \"../ansible/scraper-setup.yml\"\n  }\n}\n```\nNote: only DigitalOcean terraform provider is supported atm.\nSee [datasets](datasets/) for some examples.\n\n### Benchmarks\nFor quick, small projects under 10K URLs storing the queue and scraped content under SQLite is fine. For anything larger use PostgreSQL. You will be able to start/stop/resume the scraping process across multiple scraper instances each with its own IP and/or dedicated proxies. \n\nUsing a PostgreSQL database and 4 scraper instances it takes 9 minutes to scrape 1 million URLs. That's 0.5ms per scraped URL. Scrapers are using synthetic data, there is no external traffic, results are not influenced by web server response times and upload/download speeds. See [benchmarks](https://github.com/get-set-fetch/benchmarks) for more info.\n\n![](https://get-set-fetch.github.io/benchmarks/charts/v0.10.0-total-exec-time-1e6-saved-entries.svg)\n\n### Getting Started\n\nWhat follows is a brief \"Getting Started\" guide using SQLite as storage and Puppeteer as browser client. For an in-depth documentation visit [getsetfetch.org](https://www.getsetfetch.org). See [changelog](changelog.md) for past release notes and [development](development.md) for technical tidbits.\n\n#### Install the scraper\n```\n$ npm install @get-set-fetch/scraper\n```\n\n#### Install peer dependencies\n```\n$ npm install knex @vscode/sqlite3 puppeteer\n```\nSupported storage options and browser clients are defined as peer dependencies. Manually install your selected choices.\n\n#### Init storage\n```js\nconst { KnexConnection } = require('@get-set-fetch/scraper');\nconst connConfig = {\n  client: 'sqlite3',\n  useNullAsDefault: true,\n  connection: {\n    filename: ':memory:'\n  }\n}\nconst conn = new KnexConnection(connConfig);\n```\nSee [Storage](https://www.getsetfetch.org/node/storage.html) on full configurations for supported SQLite, MySQL, PostgreSQL.\n\n#### Init browser client\n```js\nconst { PuppeteerClient } = require('@get-set-fetch/scraper');\nconst launchOpts = {\n  headless: true,\n}\nconst client = new PuppeteerClient(launchOpts);\n```\n\n#### Init scraper\n```js\nconst { Scraper } = require('@get-set-fetch/scraper');\nconst scraper = new Scraper(conn, client);\n```\n\n#### Define project options\n```js\nconst projectOpts = {\n  name: \"myScrapeProject\",\n  pipeline: 'browser-static-content',\n  pluginOpts: [\n    {\n      name: 'ExtractUrlsPlugin',\n      maxDepth: 3,\n      selectorPairs: [\n        {\n          urlSelector: '#searchResults ~ .pagination \u003e a.ChoosePage:nth-child(2)',\n        },\n        {\n          urlSelector: 'h3.booktitle a.results',\n        },\n        {\n          urlSelector: 'a.coverLook \u003e img.cover',\n        },\n      ],\n    },\n    {\n      name: 'ExtractHtmlContentPlugin',\n      selectorPairs: [\n        {\n          contentSelector: 'h1.work-title',\n          label: 'title',\n        },\n        {\n          contentSelector: 'h2.edition-byline a',\n          label: 'author',\n        },\n        {\n          contentSelector: 'ul.readers-stats \u003e li.avg-ratings \u003e span[itemProp=\"ratingValue\"]',\n          label: 'rating value',\n        },\n        {\n          contentSelector: 'ul.readers-stats \u003e li \u003e span[itemProp=\"reviewCount\"]',\n          label: 'review count',\n        },\n      ],\n    },\n  ],\n  resources: [\n    {\n      url: 'https://openlibrary.org/authors/OL34221A/Isaac_Asimov?page=1'\n    }\n  ]\n};\n```\nYou can define a project in multiple ways. The above example is the most direct one.\n\nYou define one or more starting urls, a predefined pipeline containing a series of scrape plugins with default options, and any plugin options you want to override. See [pipelines](https://www.getsetfetch.org/node/pipelines.html) and [plugins](https://www.getsetfetch.org/node/plugins.html) for all available options.\n\nExtractUrlsPlugin.maxDepth defines a maximum depth of resources to be scraped. The starting resource has depth 0. Resources discovered from it have depth 1 and so on. A value of -1 disables this check.\n\nExtractUrlsPlugin.selectorPairs defines CSS selectors for discovering new resources. urlSelector property selects the links while the optional titleSelector can be used for renaming binary resources like images or pdfs. In order, the define selectorPairs extract pagination URLs, book detail URLs, image cover URLs.\n\nExtractHtmlContentPlugin.selectorPairs scrapes content via CSS selectors. Optional labels can be used for specifying columns when exporting results as csv.\n\n#### Define concurrency options\n```js\nconst concurrencyOpts = {\n  project: {\n    delay: 1000\n  }\n  domain: {\n    delay: 5000\n  }\n}\n```\nA minimum delay of 5000 ms will be enforced between scraping consecutive resources from the same domain. At project level, across all domains, any two resources will be scraped with a minimum 1000 ms delay between requests. See [concurrency options](https://www.getsetfetch.org/node/scrape.html#concurrency-options) for all available options.\n\n#### Start scraping\n```js\nscraper.scrape(projectOpts, concurrencyOpts);\n```\nThe entire process is asynchronous. Listen to the emitted [scrape events](https://www.getsetfetch.org/node/scrape.html#scrape-events) to monitor progress.\n\n#### Export results\n```js\nconst { ScrapeEvent, CsvExporter, ZipExporter } = require('@get-set-fetch/scraper');\n\nscraper.on(ScrapeEvent.ProjectScraped, async (project) =\u003e {\n  const csvExporter = new CsvExporter({ filepath: 'books.csv' });\n  await csvExporter.export(project);\n\n  const zipExporter = new ZipExporter({ filepath: 'book-covers.zip' });\n  await zipExporter.export(project);\n})\n```\nWait for scraping to complete by listening to `ProjectScraped` event.\n\nExport scraped html content as csv. Export scraped images under a zip archive. See [Export](https://www.getsetfetch.org/node/export.html) for all supported parameters.\n\n\n#### Browser Extension\nThis project is based on lessons learned developing [get-set-fetch-extension](https://github.com/get-set-fetch/extension), a scraping browser extension for Chrome, Firefox and Edge.\n\nBoth projects share the same storage, pipelines, plugins concepts but unfortunately no code. I'm planning to fix this in the future so code from scraper can be used in the extension. ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fget-set-fetch%2Fscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fget-set-fetch%2Fscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fget-set-fetch%2Fscraper/lists"}