{"id":18301288,"url":"https://github.com/spider-rs/spider-nodejs","last_synced_at":"2025-03-31T11:01:31.410Z","repository":{"id":209314827,"uuid":"723721413","full_name":"spider-rs/spider-nodejs","owner":"spider-rs","description":"Spider ported to Node.js","archived":false,"fork":false,"pushed_at":"2025-01-28T18:09:51.000Z","size":4426,"stargazers_count":42,"open_issues_count":0,"forks_count":7,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-23T11:02:16.853Z","etag":null,"topics":["crawler","distributed-systems","headless-chrome","indexer","nodejs","scraper","spider","typescript"],"latest_commit_sha":null,"homepage":"https://spider-rs.github.io/spider-nodejs/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/spider-rs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-26T15:12:01.000Z","updated_at":"2025-03-07T11:36:30.000Z","dependencies_parsed_at":"2023-12-15T02:58:16.065Z","dependency_job_id":"7348bbd4-1fe5-4462-b3b9-d7c4e3e19807","html_url":"https://github.com/spider-rs/spider-nodejs","commit_stats":{"total_commits":200,"total_committers":2,"mean_commits":100.0,"dds":"0.0050000000000000044","last_synced_commit":"1a5e8ab7cb1991194bd60305121b2c055d3b8b4e"},"previous_names":["spider-rs/spider-nodejs"],"tags_count":48,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spider-rs%2Fspider-nodejs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spider-rs%2Fspider-nodejs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spider-rs%2Fspider-nodejs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spider-rs%2Fspider-nodejs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/spider-rs","download_url":"https://codeload.github.com/spider-rs/spider-nodejs/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246457967,"owners_count":20780676,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","distributed-systems","headless-chrome","indexer","nodejs","scraper","spider","typescript"],"created_at":"2024-11-05T15:15:03.082Z","updated_at":"2025-03-31T11:01:31.374Z","avatar_url":"https://github.com/spider-rs.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# spider-rs\n\nThe [spider](https://github.com/spider-rs/spider) project ported to Node.js\n\n## Getting Started\n\n1. `npm i @spider-rs/spider-rs --save`\n\n```ts\nimport { Website, pageTitle } from '@spider-rs/spider-rs'\n\nconst website = new Website('https://rsseau.fr')\n  .withHeaders({\n    authorization: 'somerandomjwt',\n  })\n  .withBudget({\n    '*': 20, // limit max request 20 pages for the website\n    '/docs': 10, // limit only 10 pages on the `/docs` paths\n  })\n  .withBlacklistUrl(['/resume']) // regex or pattern matching to ignore paths\n  .build()\n\n// optional: page event handler\nconst onPageEvent = (_err, page) =\u003e {\n  const title = pageTitle(page) // comment out to increase performance if title not needed\n  console.info(`Title of ${page.url} is '${title}'`)\n  website.pushData({\n    status: page.statusCode,\n    html: page.content,\n    url: page.url,\n    title,\n  })\n}\n\nawait website.crawl(onPageEvent)\nawait website.exportJsonlData('./storage/rsseau.jsonl')\nconsole.log(website.getLinks())\n```\n\nCollect the resources for a website.\n\n```ts\nimport { Website } from '@spider-rs/spider-rs'\n\nconst website = new Website('https://rsseau.fr')\n  .withBudget({\n    '*': 20,\n    '/docs': 10,\n  })\n  // you can use regex or string matches to ignore paths\n  .withBlacklistUrl(['/resume'])\n  .build()\n\nawait website.scrape()\nconsole.log(website.getPages())\n```\n\nRun the crawls in the background on another thread.\n\n```ts\nimport { Website } from '@spider-rs/spider-rs'\n\nconst website = new Website('https://rsseau.fr')\n\nconst onPageEvent = (_err, page) =\u003e {\n  console.log(page)\n}\n\nawait website.crawl(onPageEvent, true)\n// runs immediately\n```\n\nUse headless Chrome rendering for crawls.\n\n```ts\nimport { Website } from '@spider-rs/spider-rs'\n\nconst website = new Website('https://rsseau.fr').withChromeIntercept(true, true)\n\nconst onPageEvent = (_err, page) =\u003e {\n  console.log(page)\n}\n\n// the third param determines headless chrome usage.\nawait website.crawl(onPageEvent, false, true)\nconsole.log(website.getLinks())\n```\n\nCron jobs can be done with the following.\n\n```ts\nimport { Website } from '@spider-rs/spider-rs'\n\nconst website = new Website('https://choosealicense.com').withCron('1/5 * * * * *')\n// sleep function to test cron\nconst stopCron = (time: number, handle) =\u003e {\n  return new Promise((resolve) =\u003e {\n    setTimeout(() =\u003e {\n      resolve(handle.stop())\n    }, time)\n  })\n}\n\nconst links = []\n\nconst onPageEvent = (err, value) =\u003e {\n  links.push(value)\n}\n\nconst handle = await website.runCron(onPageEvent)\n\n// stop the cron in 4 seconds\nawait stopCron(4000, handle)\n```\n\nUse the crawl shortcut to get the page content and url.\n\n```ts\nimport { crawl } from '@spider-rs/spider-rs'\n\nconst { links, pages } = await crawl('https://rsseau.fr')\nconsole.log(pages)\n```\n\n## Benchmarks\n\nView the [benchmarks](./bench/README.md) to see a breakdown between libs and platforms.\n\nTest url: `https://espn.com`\n\n| `libraries`                  | `pages`   | `speed` |\n| :--------------------------- | :-------- | :------ |\n| **`spider(rust): crawl`**    | `150,387` | `1m`    |\n| **`spider(nodejs): crawl`**  | `150,387` | `153s`  |\n| **`spider(python): crawl`**  | `150,387` | `186s`  |\n| **`scrapy(python): crawl`**  | `49,598`  | `1h`    |\n| **`crawlee(nodejs): crawl`** | `18,779`  | `30m`   |\n\nThe benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.\n\n## Development\n\nInstall the napi cli `npm i @napi-rs/cli --global`.\n\n1. `yarn build:test`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspider-rs%2Fspider-nodejs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspider-rs%2Fspider-nodejs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspider-rs%2Fspider-nodejs/lists"}