https://github.com/themaximalist/scrape.js

Web Scraping Library for Node.js
https://github.com/themaximalist/scrape.js

scraping web web-scraping

Last synced: about 1 year ago
JSON representation

Web Scraping Library for Node.js

Host: GitHub
URL: https://github.com/themaximalist/scrape.js
Owner: themaximalist
License: mit
Created: 2023-05-06T23:51:29.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2024-02-23T08:37:14.000Z (over 2 years ago)
Last Synced: 2025-06-03T08:32:13.129Z (about 1 year ago)
Topics: scraping, web, web-scraping
Language: CSS
Homepage: https://scrapejs.themaximalist.com/
Size: 140 KB
Stars: 3
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

          ## Scrape.js


















`Scrape.js` is an easy to use web scraping library for Node.js.

```javascript

const data = await scrape("https://example.com");

// { url, html }

```

**Features**

* Fast

* Scrape nearly any website

* Headless JavaScript scraping

* Auto proxy rotation

* ...it just works

* MIT License

## Install

Install `Scrape.js` from NPM:

```bash

npm install @themaximalist/scrape.js

```

## Config

`Scrape.js` uses [Zen Rows](https://www.zenrows.com/) for proxy rotation. To use it acquire a Zen Rows API key and setup the environment variable.

```bash

ZENROWS_API_KEY=abcxyz123

```

`Scrape.js` can be used without proxies, but is less effective.

## Usage

Using `Scrape.js` is as simple as calling a function with a website URL.

```javascript

const scrape = require("@themaximalist/scrape.js");

await scrape("http://example.com"); // { url, html }

```

You can specify additional options to `scrape()` for more control:

```javascript

const data = await scrape("https://example.com", {

    headless: true,

    proxy: true

});

// { url, html }

```

## API

The `Scrape.js` API is a simple function you call with your URL, with an optional config object.

```javascript

await scrape(

    url, // URL to scrape

    {

        headless: true, // Use JavaScript headless scraping

        proxy: true, // Use proxy rotation

        method: "GET", // HTTP Request method

        timeout: 3000, // Scrape timeout in ms

        userAgent: "Mozilla/5.0...", // User Agent

    }

);

```

**URL (required)**

* **`url`** ``: URL to scrape

**Options**

* **`headless`** ``: Enable JavaScript. Default is `true`.

* **`proxy`** ``: Use proxy with request. Default is `true`.

* **`method`** ``: HTTP request method, usually `GET` or `POST`. Default is `GET`.

* **`timeout`** ``: Max request time in ms. Default is `3500`.

* **`userAgent`** ``: User agent for request. Default is `Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36`.

**Response**

`Scrape.js` returns an `object` containing the final `url` and `html` content.

```javascript

const { url, html } = await scrape("https://example.com");

console.log(url); // https://example.com/

console.log(html); //  DEBUG=scrape.js*

> node src/get_website_html.js

# debug logs

```

## Examples

View [tests](https://github.com/themaximal1st/scrape.js/tree/main/test) to examples on how to use `Scrape.js`.

## Projects

`Scrape.js` is currently used in the following projects:

-   [News Score](https://newsscore.com) — score the news, score the news, rewrite the headlines

## License

MIT

## Author

Created by [The Maximalist](https://twitter.com/themaximal1st), see our [open-source projects](https://themaximalist.com/products).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/themaximalist/scrape.js

Awesome Lists containing this project

README