An open API service indexing awesome lists of open source software.

https://github.com/apify/actor-crawler-cheerio

DEPRECATED: An actor that crawls websites and parses HTML pages using Cheerio library. Supports recursive crawling as well as URL lists.
https://github.com/apify/actor-crawler-cheerio

Last synced: 8 months ago
JSON representation

DEPRECATED: An actor that crawls websites and parses HTML pages using Cheerio library. Supports recursive crawling as well as URL lists.

Awesome Lists containing this project

README

          

# DEPRECATED: Apify Crawler Cheerio

Visit https://github.com/apifytech/actor-scraper/tree/master/cheerio-scraper for the current version.

- [How it works](#how-it-works)
- [Input](#input)
* [Page function](#page-function)
- [Output](#output)
* [Dataset](#dataset)

## How it works

Crawler Cheerio is a ready-made solution for crawling the web using plain HTTP requests to retrieve HTML pages
and then parsing and inspecting the HTML using the [Cheerio](https://www.npmjs.com/package/cheerio) NPM package.

Cheerio is a server-side version of the popular [jQuery](https://jquery.com) library, that does not run in the
browser, but instead constructs a DOM out of a HTML string and then provides the user with API to work with that
DOM.

Crawler Cheerio is ideal for scraping websites that do not rely on client-side JavaScript to serve their content.
It can be as much as 20 times faster than using a full browser solution such as Puppeteer.

## Input
Input is provided via the pre-configured form. See the tooltips for more info on the available options.

### Page function
Page function enables the user to control the Crawler's operation, manipulate the received HTML
and extract data as needed. It is invoked with a `context` object containing the following properties:

```js
const context = {
actorId, // ID of this actor.
runId, // ID of the individual actor run.
request, // Apify.Request object.
response, // http.IncomingMessage object (Node.js server response).
html, // The scraped HTML string.
$, // Cheerio, with the HTML already loaded and ready to use.
customData, // Value of the 'Custom data' Crawler option.
requestList, // Reference to the run's default Apify.RequestList.
requestQueue, // Reference to the run's default Apify.RequestQueue.
dataset, // Reference to the run's default Apify.Dataset.
keyValueStore, // Reference to the run's default Apify.KeyValueStore.
input, // Unaltered original input as parsed from the UI.
client, // Reference to the an instance of the Apify.client.
log, // Reference to Apify.utils.log

// Utility functions that simplify some common tasks.
// See https://www.apify.com/docs/crawler#pageFunction for docs.
skipLinks,
skipOutput,
enqueuePage,
}
```

## Output

Ouput is a dataset containing extracted data for each scraped page.

### Dataset
For each of the scraped URLs, the dataset contains an object with results and some metadata.
If you were scraping the HTML `` of [IANA](https://www.iana.org/) it would look like this:

```json
{
"title": "Internet Assigned Numbers Authority",
"#error": false,
"#debug": {
"url": "https://www.iana.org/",
"method": "GET",
"retryCount": 0,
"errorMessages": null,
"requestId": "e2Hd517QWfF4tVh"
}
}
```

The metadata are prefixed with a `#`. Soon you will be able to exclude the metadata
from the results by providing an API flag.