https://github.com/apify/actor-crawler-cheerio

DEPRECATED: An actor that crawls websites and parses HTML pages using Cheerio library. Supports recursive crawling as well as URL lists.
https://github.com/apify/actor-crawler-cheerio

Last synced: 8 months ago
JSON representation

DEPRECATED: An actor that crawls websites and parses HTML pages using Cheerio library. Supports recursive crawling as well as URL lists.

Host: GitHub
URL: https://github.com/apify/actor-crawler-cheerio
Owner: apify
License: apache-2.0
Created: 2018-06-07T13:48:23.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2022-07-25T12:39:13.000Z (almost 4 years ago)
Last Synced: 2025-02-16T16:19:52.991Z (over 1 year ago)
Language: JavaScript
Homepage: https://www.apify.com/apify/crawler-cheerio
Size: 136 KB
Stars: 1
Watchers: 4
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# DEPRECATED: Apify Crawler Cheerio

Visit https://github.com/apifytech/actor-scraper/tree/master/cheerio-scraper for the current version.

- [How it works](#how-it-works)
- [Input](#input)
* [Page function](#page-function)
- [Output](#output)
* [Dataset](#dataset)

## How it works

Crawler Cheerio is a ready-made solution for crawling the web using plain HTTP requests to retrieve HTML pages
and then parsing and inspecting the HTML using the [Cheerio](https://www.npmjs.com/package/cheerio) NPM package.

Cheerio is a server-side version of the popular [jQuery](https://jquery.com) library, that does not run in the
browser, but instead constructs a DOM out of a HTML string and then provides the user with API to work with that
DOM.

Crawler Cheerio is ideal for scraping websites that do not rely on client-side JavaScript to serve their content.
It can be as much as 20 times faster than using a full browser solution such as Puppeteer.

## Input
Input is provided via the pre-configured form. See the tooltips for more info on the available options.

### Page function
Page function enables the user to control the Crawler's operation, manipulate the received HTML
and extract data as needed. It is invoked with a `context` object containing the following properties:

```js
const context = {
actorId, // ID of this actor.
runId, // ID of the individual actor run.
request, // Apify.Request object.
response, // http.IncomingMessage object (Node.js server response).
html, // The scraped HTML string.
$, // Cheerio, with the HTML already loaded and ready to use.
customData, // Value of the 'Custom data' Crawler option.
requestList, // Reference to the run's default Apify.RequestList.
requestQueue, // Reference to the run's default Apify.RequestQueue.
dataset, // Reference to the run's default Apify.Dataset.
keyValueStore, // Reference to the run's default Apify.KeyValueStore.
input, // Unaltered original input as parsed from the UI.
client, // Reference to the an instance of the Apify.client.
log, // Reference to Apify.utils.log

// Utility functions that simplify some common tasks.
// See https://www.apify.com/docs/crawler#pageFunction for docs.
skipLinks,
skipOutput,
enqueuePage,
}
```

## Output

Ouput is a dataset containing extracted data for each scraped page.

### Dataset
For each of the scraped URLs, the dataset contains an object with results and some metadata.
If you were scraping the HTML `` of [IANA](https://www.iana.org/) it would look like this:

```json
{
"title": "Internet Assigned Numbers Authority",
"#error": false,
"#debug": {
"url": "https://www.iana.org/",
"method": "GET",
"retryCount": 0,
"errorMessages": null,
"requestId": "e2Hd517QWfF4tVh"
}
}
```

The metadata are prefixed with a `#`. Soon you will be able to exclude the metadata
from the results by providing an API flag.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/apify/actor-crawler-cheerio

Awesome Lists containing this project

README