https://github.com/apify/actor-crawler-cheerio
DEPRECATED: An actor that crawls websites and parses HTML pages using Cheerio library. Supports recursive crawling as well as URL lists.
https://github.com/apify/actor-crawler-cheerio
Last synced: 8 months ago
JSON representation
DEPRECATED: An actor that crawls websites and parses HTML pages using Cheerio library. Supports recursive crawling as well as URL lists.
- Host: GitHub
- URL: https://github.com/apify/actor-crawler-cheerio
- Owner: apify
- License: apache-2.0
- Created: 2018-06-07T13:48:23.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2022-07-25T12:39:13.000Z (almost 4 years ago)
- Last Synced: 2025-02-16T16:19:52.991Z (over 1 year ago)
- Language: JavaScript
- Homepage: https://www.apify.com/apify/crawler-cheerio
- Size: 136 KB
- Stars: 1
- Watchers: 4
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DEPRECATED: Apify Crawler Cheerio
Visit https://github.com/apifytech/actor-scraper/tree/master/cheerio-scraper for the current version.
- [How it works](#how-it-works)
- [Input](#input)
* [Page function](#page-function)
- [Output](#output)
* [Dataset](#dataset)
## How it works
Crawler Cheerio is a ready-made solution for crawling the web using plain HTTP requests to retrieve HTML pages
and then parsing and inspecting the HTML using the [Cheerio](https://www.npmjs.com/package/cheerio) NPM package.
Cheerio is a server-side version of the popular [jQuery](https://jquery.com) library, that does not run in the
browser, but instead constructs a DOM out of a HTML string and then provides the user with API to work with that
DOM.
Crawler Cheerio is ideal for scraping websites that do not rely on client-side JavaScript to serve their content.
It can be as much as 20 times faster than using a full browser solution such as Puppeteer.
## Input
Input is provided via the pre-configured form. See the tooltips for more info on the available options.
### Page function
Page function enables the user to control the Crawler's operation, manipulate the received HTML
and extract data as needed. It is invoked with a `context` object containing the following properties:
```js
const context = {
actorId, // ID of this actor.
runId, // ID of the individual actor run.
request, // Apify.Request object.
response, // http.IncomingMessage object (Node.js server response).
html, // The scraped HTML string.
$, // Cheerio, with the HTML already loaded and ready to use.
customData, // Value of the 'Custom data' Crawler option.
requestList, // Reference to the run's default Apify.RequestList.
requestQueue, // Reference to the run's default Apify.RequestQueue.
dataset, // Reference to the run's default Apify.Dataset.
keyValueStore, // Reference to the run's default Apify.KeyValueStore.
input, // Unaltered original input as parsed from the UI.
client, // Reference to the an instance of the Apify.client.
log, // Reference to Apify.utils.log
// Utility functions that simplify some common tasks.
// See https://www.apify.com/docs/crawler#pageFunction for docs.
skipLinks,
skipOutput,
enqueuePage,
}
```
## Output
Ouput is a dataset containing extracted data for each scraped page.
### Dataset
For each of the scraped URLs, the dataset contains an object with results and some metadata.
If you were scraping the HTML `` of [IANA](https://www.iana.org/) it would look like this:
```json
{
"title": "Internet Assigned Numbers Authority",
"#error": false,
"#debug": {
"url": "https://www.iana.org/",
"method": "GET",
"retryCount": 0,
"errorMessages": null,
"requestId": "e2Hd517QWfF4tVh"
}
}
```
The metadata are prefixed with a `#`. Soon you will be able to exclude the metadata
from the results by providing an API flag.