https://github.com/spider-rs/spider-nodejs

Spider ported to Node.js
https://github.com/spider-rs/spider-nodejs

crawler distributed-systems headless-chrome indexer nodejs scraper spider typescript

Last synced: 6 months ago
JSON representation

Spider ported to Node.js

Host: GitHub
URL: https://github.com/spider-rs/spider-nodejs
Owner: spider-rs
License: mit
Created: 2023-11-26T15:12:01.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2025-01-28T18:09:51.000Z (8 months ago)
Last Synced: 2025-03-23T11:02:16.853Z (7 months ago)
Topics: crawler, distributed-systems, headless-chrome, indexer, nodejs, scraper, spider, typescript
Language: Rust
Homepage: https://spider-rs.github.io/spider-nodejs/
Size: 4.22 MB
Stars: 42
Watchers: 2
Forks: 7
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # spider-rs

The [spider](https://github.com/spider-rs/spider) project ported to Node.js

## Getting Started

1. `npm i @spider-rs/spider-rs --save`

```ts

import { Website, pageTitle } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')

  .withHeaders({

    authorization: 'somerandomjwt',

  })

  .withBudget({

    '*': 20, // limit max request 20 pages for the website

    '/docs': 10, // limit only 10 pages on the `/docs` paths

  })

  .withBlacklistUrl(['/resume']) // regex or pattern matching to ignore paths

  .build()

// optional: page event handler

const onPageEvent = (_err, page) => {

  const title = pageTitle(page) // comment out to increase performance if title not needed

  console.info(`Title of ${page.url} is '${title}'`)

  website.pushData({

    status: page.statusCode,

    html: page.content,

    url: page.url,

    title,

  })

}

await website.crawl(onPageEvent)

await website.exportJsonlData('./storage/rsseau.jsonl')

console.log(website.getLinks())

```

Collect the resources for a website.

```ts

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')

  .withBudget({

    '*': 20,

    '/docs': 10,

  })

  // you can use regex or string matches to ignore paths

  .withBlacklistUrl(['/resume'])

  .build()

await website.scrape()

console.log(website.getPages())

```

Run the crawls in the background on another thread.

```ts

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')

const onPageEvent = (_err, page) => {

  console.log(page)

}

await website.crawl(onPageEvent, true)

// runs immediately

```

Use headless Chrome rendering for crawls.

```ts

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr').withChromeIntercept(true, true)

const onPageEvent = (_err, page) => {

  console.log(page)

}

// the third param determines headless chrome usage.

await website.crawl(onPageEvent, false, true)

console.log(website.getLinks())

```

Cron jobs can be done with the following.

```ts

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://choosealicense.com').withCron('1/5 * * * * *')

// sleep function to test cron

const stopCron = (time: number, handle) => {

  return new Promise((resolve) => {

    setTimeout(() => {

      resolve(handle.stop())

    }, time)

  })

}

const links = []

const onPageEvent = (err, value) => {

  links.push(value)

}

const handle = await website.runCron(onPageEvent)

// stop the cron in 4 seconds

await stopCron(4000, handle)

```

Use the crawl shortcut to get the page content and url.

```ts

import { crawl } from '@spider-rs/spider-rs'

const { links, pages } = await crawl('https://rsseau.fr')

console.log(pages)

```

## Benchmarks

View the [benchmarks](./bench/README.md) to see a breakdown between libs and platforms.

Test url: `https://espn.com`

| `libraries`                  | `pages`   | `speed` |

| :--------------------------- | :-------- | :------ |

| **`spider(rust): crawl`**    | `150,387` | `1m`    |

| **`spider(nodejs): crawl`**  | `150,387` | `153s`  |

| **`spider(python): crawl`**  | `150,387` | `186s`  |

| **`scrapy(python): crawl`**  | `49,598`  | `1h`    |

| **`crawlee(nodejs): crawl`** | `18,779`  | `30m`   |

The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.

## Development

Install the napi cli `npm i @napi-rs/cli --global`.

1. `yarn build:test`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/spider-rs/spider-nodejs

Awesome Lists containing this project

README