https://github.com/apify/super-scraper

Generic REST API for scraping websites. Drop-in replacement for ScrapingBee, ScrapingAnt, and ScraperAPI services. And it is open-source!
https://github.com/apify/super-scraper

api apify cheerio javascript nodejs playwright scraping typescript web-scraping

Last synced: 28 days ago
JSON representation

Generic REST API for scraping websites. Drop-in replacement for ScrapingBee, ScrapingAnt, and ScraperAPI services. And it is open-source!

Host: GitHub
URL: https://github.com/apify/super-scraper
Owner: apify
Created: 2024-03-18T11:30:12.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-11-29T08:40:19.000Z (10 months ago)
Last Synced: 2025-04-11T22:11:27.176Z (6 months ago)
Topics: api, apify, cheerio, javascript, nodejs, playwright, scraping, typescript, web-scraping
Language: TypeScript
Homepage: https://apify.com/apify/super-scraper-api
Size: 101 KB
Stars: 25
Watchers: 7
Forks: 7
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# SuperScraper API

SuperScraper API is an Actor that provides a REST API for scraping websites.
Just pass the URL of a web page and get back the fully rendered HTML content.
SuperScraper API is compatible with [ScrapingBee](https://www.scrapingbee.com/),
[ScrapingAnt](https://scrapingant.com/),
and [ScraperAPI](https://scraperapi.com/) interfaces.

Main features:
- Extract HTML from arbitrary URLs with a headless browser for dynamic content rendering.
- Circumvent blocking using datacenter or residential proxies, as well as browser fingerprinting.
- Seamlessly scale to a large number of web pages as needed.
- Capture screenshots of the web pages.

Note that SuperScraper API uses the new experimental Actor Standby mode, so it's not started the traditional way from Apify Console.
Instead, it's invoked via the HTTP REST API provided directly by the Actor. See the examples below.

## Usage examples

To run these examples, you need an Apify API token,
which you can find under [Settings > Integrations](https://console.apify.com/account/integrations) in Apify Console.

You can create an Apify account free of charge.

### Node.js

```ts
import axios from 'axios';

const resp = await axios.get('https://super-scraper-api.apify.actor/', {
params: {
url: 'https://apify.com/store',
wait_for: '.ActorStoreItem-title',
json_response: true,
screenshot: true,
},
headers: {
Authorization: 'Bearer ',
},
});

console.log(resp.data);
```

### curl

```shell
curl -X GET \
'https://super-scraper-api.apify.actor/?url=https://apify.com/store&wait_for=.ActorStoreItem-title&screenshot=true&json_response=true' \
--header 'Authorization: Bearer '
```

## Authentication

The best way to authenticate is to pass your Apify API token using the `Authorization` HTTP header.
Alternatively, you can pass the API token via the `token` query parameter to authenticate the requests, which is more convenient for testing in a web browser.

### Node.js

```ts
const resp = await axios.get('https://super-scraper-api.apify.actor/', {
params: {
url: 'https://apify.com/store',
token: ''
},
});
```

### curl

```shell
curl -X GET 'https://super-scraper-api.apify.actor/?url=https://apify.com/store&wait_for=.ActorStoreItem-title&json_response=true&token='
```

## Pricing

When using SuperScraper API, you're charged based on your actual usage of the Apify platform's computing, storage, and networking resources.

Cost depends on the target sites, your settings and API parameters, the load of your requests, and random network and target site conditions.

The best way to see your price is to conduct a real-world test.

An example cost on a free account (the pricing is cheaper on higher plans) for 30 one-by-one requests plus 50 batched requests test:

| parameters | cost estimate
| ------------- |-----------------------------------|
| no `render_js` + basic proxy | $1/1000 requests
| no `render_js` + premium (residential) proxy | $2/1000 requests
| `render_js` + basic proxy | $4/1000 requests
| `render_js` + premium (residential) proxy | $5/1000 requests

## API parameters

### ScrapingBee API parameters

SuperScraper API supports most of the API parameters of [ScrapingBee](https://www.scrapingbee.com/documentation/):

| parameter | description |
| -------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `url` | URL of the webpage to be scraped. **This parameter is required.** |
| `json_response` | Return a verbose JSON response with additional details about the webpage. Can be either `true` or `false`, default is `false`. |
| `extract_rules` | A stringified JSON containing custom rules how to extract data from the webpage. |
| `render_js` | Indicates that the webpage should be scraped using a headless browser, with dynamic content rendered. Can be `true` or `false`, default is `true`. This is equivalent to ScrapingAnt's `browser`. |
| `screenshot` | Get screenshot of the browser's current viewport. If `json_response` is set to `true`, screenshot will be returned in the Base64 encoding. Can be `true` or `false`, default is `false`. |
| `screenshot_full_page` | Get screenshot of the full page. If `json_response` is set to `true`, screenshot will be returned in the Base64 encoding. Can be `true` or `false`, default is `false`. |
| `screenshot_selector` | Get screenshot of the element specified by the selector. If `json_response` is set to `true`, screenshot will be returned in Base64. Must be a non-empty string. |
| `js_scenario` | JavaScript instructions that will be executed after loading the webpage. |
| `wait` | Specify a duration that the browser will wait after loading the page, in milliseconds. |
| `wait_for` | Specify a CSS selector of an element for which the browser will wait after loading the page. |
| `wait_browser` | Specify a browser event to wait for. Can be either `load`, `domcontentloaded`, or `networkidle`. |
| `block_resources` | Specify that you want to block images and CSS. Can be `true` or `false`, default is `true`. |
| `window_width` | Specify the width of the browser's viewport, in pixels. |
| `window_height` | Specify the height of the browser's viewport, in pixels. |
| `cookies` | Custom cookies to use to fetch the web pages. This is useful for fetching webpage behing login. The cookies must be specified in a string format: `cookie_name_1=cookie_value1;cookie_name_2=cookie_value_2`. |
| `own_proxy` | A custom proxy to be used for scraping, in the format `:@:`. |
| `premium_proxy` | Use residential proxies to fetch the web content, in order to reduce the probability of being blocked. Can be either `true` or `false`, default is `false`. |
| `stealth_proxy` | Works same as `premium_proxy`. |
| `country_code` | Use IP addresses that are geolocated in the specified country by specifying its [2-letter ISO code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Officially_assigned_code_elements). When using code other than `US`, `premium_proxy` must be set to `true`. This is equivalent to setting ScrapingAnt's `proxy_country`. |
| `custom_google` | Use this option if you want to scrape Google-related websites (such as Google Searach or Google Shopping). Can be `true` or `false`, default is `false`. |
| `return_page_source` | Return HTML of the webpage from the response before any dynamic JavaSript rendering. Can be `true` or `false`, default is `false`. |
| `transparent_status_code` | By default, if target webpage responds with HTTP status code other than a 200-299 or a 404, the API will return a HTTP status code 500. Set this paremeter to `true` to disable this behavior and return the status code of the actual response. |
| `timeout` | Set maximum timeout for the response from this Actor, in milliseconds. The default is 140 000 ms. |
| `forward_headers` | If set to `true`, HTTP headers starting with prefix `Spb-` or `Ant-` will be forwarded to the target webpage alongside headers generated by us (the prefix will be trimmed). |
| `forward_headers_pure` | If set to `true`, only headers starting with prefix `Spb-` or `Ant-` will be forwarded to the target webpage (prefix will be trimmed), without any other HTTP headers from our side. |
| `device` | Can be either `desktop` (default) or `mobile`. |

ScrapingBee's API parameters `block_ads` and `session_id` are currently not supported.

### ScrapingAnt API parameters

SuperScraper API supports most of the API parameters of [ScrapingAnt](https://docs.scrapingant.com/request-response-format#available-parameters):

ScrapingAnt's API parameter `x-api-key` is not supported.

Note that HTTP headers in a request to this Actor beginning with prefix `Ant-` will be forwarded (without the prefix) to the target webpage alongside headers generated by the Actor.
This behavior can be changed using ScrapingBee's `forward_headers` or `forward_headers_pure` parameters.

### ScraperAPI API parameters

SuperScraper API supports most of the API parameters of [ScraperAPI](https://docs.scraperapi.com/making-requests/customizing-requests):

ScraperAPI's API parameters `session_number` and `autoparse` are currently not supported, and they are ignored.

### Custom extraction rules

Using ScrapingBee's `extract_rules` parameter, you can specify a set of rules to extract specific data from the target web pages. You can create an extraction rule in one of two ways: with shortened options, or with full options.

#### Shortened options

- value for the given key serves as a `selector`
- using `@`, we can access attribute of the selected element

##### Example:

```json
{
"title": "h1",
"link": "a@href"
}
```

#### Full options

- `selector` is required
- `type` can be either `item` (default) or `list`
- `output` indicates how the result for these element(s) will look like. It can be:
- `text` (default option when `output` is omitted) - text of the element
- `html` - HTML of the element
- attribute name (starts with `@`, for example `@href`)
- object with other extract rules for the given item (key + shortened or full options)
- `table_json` or `table_array` to scrape a table in a json or array format
- `clean` - relevant when having `text` as `output`, specifies whether the text of the element should be trimmed of whitespaces (can be `true` or `false`, default `true`)

##### Example:

```json
{
"custom key for links": {
"selector": "a",
"type": "list",
"output": {
"linkName" : {
"selector": "a",
"clean": "false"
},
"href": {
"selector": "a",
"output": "@href"
}
}

}
}
```

#### Example

This example extracts all links from [Apify Blog](https://blog.apify.com/) along with their titles.

```ts
const extractRules = {
title: 'h1',
allLinks: {
selector: 'a',
type: 'list',
output: {
title: 'a',
link: 'a@href',
},
},
};

const resp = await axios.get('https://super-scraper-api.apify.actor/', {
params: {
url: 'https://blog.apify.com/',
extract_rules: JSON.stringify(extractRules),
// verbose: true,
},
headers: {
Authorization: 'Bearer ',
},
});

console.log(resp.data);
```

The results look like this:

```json
{
"title": "Apify Blog",
"allLinks": [
{
"title": "Data for generative AI & LLM",
"link": "https://apify.com/data-for-generative-ai"
},
{
"title": "Product matching AI",
"link": "https://apify.com/product-matching-ai"
},
{
"title": "Universal web scrapers",
"link": "https://apify.com/store/scrapers/universal-web-scrapers"
}
]
}
```

### Custom JavaScript code

Use ScrapingBee's `js_scenario` parameter to specify instructions in order to be executed one by one after opening the page.

Set `json_response` to `true` to get a full report of the executed instructions, the results of `evaluate` instructions will be added to the `evaluate_results` field.

Example of clicking a button:

```ts
const instructions = {
instructions: [
{ click: '#button' },
],
};

const resp = await axios.get('https://super-scraper-api.apify.actor/', {
params: {
url: 'https://www.example.com',
js_scenario: JSON.stringify(instructions),
},
headers: {
Authorization: 'Bearer ',
},
});

console.log(resp.data);
```

#### Strict mode

If one instruction fails, then the subsequent instructions will not be executed. To disable this behavior, you can optionally set `strict` to `false` (by default it's `true`):

```json
{
"instructions": [
{ "click": "#button1" },
{ "click": "#button2" }
],
"strict": false
}
```

#### Supported instructions

##### `wait`

- wait for some time specified in ms
- example: `{"wait": 10000}`

##### `wait_for`

- wait for an element specified by the selector
- example `{"wait_for": "#element"}`

##### `click`

- click on an element specified by the selector
- example `{"click": "#button"}`

##### `wait_for_and_click`
- combination of previous two
- example `{"wait_for_and_click": "#button"}`

##### `scroll_x` and `scroll_y`

- scroll a specified number of pixels horizontally or vertically
- example `{"scroll_y": 1000}` or `{"scroll_x": 1000}`

##### `fill`

- specify a selector of the input element and the value you want to fill
- example `{"fill": ["input_1", "value_1"]}`

##### `evaluate`

- evaluate custom javascript on the webpage
- text/number/object results will be saved in the `evaluate_results` field
- example `{"evaluate":"document.querySelectorAll('a').length"}`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/apify/super-scraper

Awesome Lists containing this project

README