An open API service indexing awesome lists of open source software.

https://github.com/ionicabizau/scrape-it

🔮 A Node.js scraper for humans.
https://github.com/ionicabizau/scrape-it

hacktoberfest node-scraper scraper

Last synced: 9 days ago
JSON representation

🔮 A Node.js scraper for humans.

Awesome Lists containing this project

README

        

[![scrape-it](https://i.imgur.com/j3Z0rbN.png)](#)

# scrape-it

[![Support me on Patreon][badge_patreon]][patreon] [![Buy me a book][badge_amazon]][amazon] [![PayPal][badge_paypal_donate]][paypal-donations] [![Ask me anything](https://img.shields.io/badge/ask%20me-anything-1abc9c.svg)](https://github.com/IonicaBizau/ama) [![Travis](https://img.shields.io/travis/IonicaBizau/scrape-it.svg)](https://travis-ci.org/IonicaBizau/scrape-it/) [![Version](https://img.shields.io/npm/v/scrape-it.svg)](https://www.npmjs.com/package/scrape-it) [![Downloads](https://img.shields.io/npm/dt/scrape-it.svg)](https://www.npmjs.com/package/scrape-it) [![Get help on Codementor](https://cdn.codementor.io/badges/get_help_github.svg)](https://www.codementor.io/@johnnyb?utm_source=github&utm_medium=button&utm_term=johnnyb&utm_campaign=github)

Buy Me A Coffee

> A Node.js scraper for humans.

----


Sponsored with :heart: by:


[![scrapeless services](https://assets.scrapeless.com/prod/posts/scrapeless-web-scraping-toolkit/770dae4cc5b004b6d262480625a47225.png)](https://www.scrapeless.com/en?utm_source=scrape-it)

[Scrapeless](http://scrapeless.com/?utm_source=scrape-it) – Easy web scraping toolkit for businesses and developers

⚡ [Scraping Browser](https://www.scrapeless.com/en/product/scraping-browser?utm_source=scrape-it):

1. Web browsing capabilities for AI agents and applications
- Collect data at scale for agents without being blocked
- Simulate user behavior using advanced browser tools
- Build agent applications with real-time and historical web data
2. Unlock any scale with unlimited parallel jobs
3. High-performance web unlocking built directly into the browser
4. Compatible with Puppeteer and Playwright

⚡ [Deep SerpApi](https://www.scrapeless.com/en/product/deep-serp-api?utm_source=scrape-it): One-click Google search data monitoring, supporting 15+ SERP scenarios such as academic/Google Store/Maps, $0.1/thousand queries, 0.2s response. Recently, Scrapeless has officially launched [MCP Server](https://github.com/scrapeless-ai/scrapeless-mcp-server), which can help large prediction models easily capture the latest data and ensure the accuracy of the results.

⚡ [Scraping API](https://www.scrapeless.com/en/product/scraping-api?utm_source=scrape-it): Easily obtain public content such as TikTok, Shopee, Amazon, Walmart, etc. Covering structured data of 8+ vertical industries such as e-commerce/social media, ready to use. Only billed by the number of successful calls.

⚡ [Universal Scraping API](https://www.scrapeless.com/en/product/universal-scraping-api?utm_source=scrape-it): Intelligent IP rotation + real user fingerprint, success rate up to 99%. No more worrying about network blockades and crawling obstacles.

⚠️ Exclusive for open source projects: Submit the Repo link to apply for 100,000 free Deep SerpApi queries!

📌 [Try it now](https://app.scrapeless.com/passport/login?utm_source=scrape-it) | [Documentation](https://docs.scrapeless.com/en/scraping-browser/quickstart/introduction/?utm_source=scrape-it)

## :cloud: Installation

```sh
# Using npm
npm install --save scrape-it

# Using yarn
yarn add scrape-it
```

:bulb: **ProTip**: You can install the [cli version of this module](https://github.com/IonicaBizau/scrape-it-cli) by running `npm install --global scrape-it-cli` (or `yarn global add scrape-it-cli`).

## FAQ

Here are some frequent questions and their answers.

### 1. How to parse scrape pages?

`scrape-it` has only a simple request module for making requests. That means you cannot directly parse ajax pages with it, but in general you will have those scenarios:

1. **The ajax response is in JSON format.** In this case, you can make the request directly, without needing a scraping library.
2. **The ajax response gives you HTML back.** Instead of calling the main website (e.g. example.com), pass to `scrape-it` the ajax url (e.g. `example.com/api/that-endpoint`) and you will you will be able to parse the response
3. **The ajax request is so complicated that you don't want to reverse-engineer it.** In this case, use a headless browser (e.g. Google Chrome, Electron, PhantomJS) to load the content and then use the `.scrapeHTML` method from scrape it once you get the HTML loaded on the page.

### 2. Crawling

There is no fancy way to crawl pages with `scrape-it`. For simple scenarios, you can parse the list of urls from the initial page and then, using Promises, parse each page. Also, you can use a different crawler to download the website and then use the `.scrapeHTML` method to scrape the local files.

### 3. Local files

Use the `.scrapeHTML` to parse the HTML read from the local files using `fs.readFile`.

## :clipboard: Example

```js
const scrapeIt = require("scrape-it")

// Promise interface
scrapeIt("https://ionicabizau.net", {
title: ".header h1"
, desc: ".header h2"
, avatar: {
selector: ".header img"
, attr: "src"
}
}).then(({ data, status }) => {
console.log(`Status Code: ${status}`)
console.log(data)
});

// Async-Await
(async () => {
const { data } = await scrapeIt("https://ionicabizau.net", {
// Fetch the articles
articles: {
listItem: ".article"
, data: {

// Get the article date and convert it into a Date object
createdAt: {
selector: ".date"
, convert: x => new Date(x)
}

// Get the title
, title: "a.article-title"

// Nested list
, tags: {
listItem: ".tags > span"
}

// Get the content
, content: {
selector: ".article-content"
, how: "html"
}

// Get attribute value of root listItem by omitting the selector
, classes: {
attr: "class"
}
}
}

// Fetch the blog pages
, pages: {
listItem: "li.page"
, name: "pages"
, data: {
title: "a"
, url: {
selector: "a"
, attr: "href"
}
}
}

// Fetch some other data from the page
, title: ".header h1"
, desc: ".header h2"
, avatar: {
selector: ".header img"
, attr: "src"
}
})
console.log(data)
// { articles:
// [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET),
// title: 'Pi Day, Raspberry Pi and Command Line',
// tags: [Object],
// content: '

Everyone knows (or should know)...a" alt="">

\n',
// classes: [Object] },
// { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET),
// title: 'How I ported Memory Blocks to modern web',
// tags: [Object],
// content: '

Playing computer games is a lot of fun. ...',
// classes: [Object] },
// { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET),
// title: 'How to convert JSON to Markdown using json2md',
// tags: [Object],
// content: '

I love and ...',
// classes: [Object] } ],
// pages:
// [ { title: 'Blog', url: '/' },
// { title: 'About', url: '/about' },
// { title: 'FAQ', url: '/faq' },
// { title: 'Training', url: '/training' },
// { title: 'Contact', url: '/contact' } ],
// title: 'Ionică Bizău',
// desc: 'Web Developer, Linux geek and Musician',
// avatar: '/images/logo.png' }
})()
```

## :memo: Documentation

### `scrapeIt(url, opts, cb)`
A scraping module for humans.

#### Params

- **String|Object** `url`: The page url or request options.
- **Object** `opts`: The options passed to `scrapeHTML` method.
- **Function** `cb`: The callback function.

#### Return
- **Promise** A promise object resolving with:
- `data` (Object): The scraped data.
- `$` (Function): The Cheeerio function. This may be handy to do some other manipulation on the DOM, if needed.
- `response` (Object): The response object.
- `body` (String): The raw body as a string.

### `scrapeIt.scrapeHTML($, opts)`
Scrapes the data in the provided element.

For the format of the selector, please refer to the [Selectors section of the Cheerio library](https://github.com/cheeriojs/cheerio#-selector-context-root-)

#### Params

- **Cheerio** `$`: The input element.
- **Object** `opts`: An object containing the scraping information.
If you want to scrape a list, you have to use the `listItem` selector:

- `listItem` (String): The list item selector.
- `data` (Object): The fields to include in the list objects:
- `` (Object|String): The selector or an object containing:
- `selector` (String): The selector.
- `convert` (Function): An optional function to change the value.
- `how` (Function|String): A function or function name to access the
value.
- `attr` (String): If provided, the value will be taken based on
the attribute name.
- `trimValue` (Boolean): If `false`, the value will *not* be trimmed
(default: `true`).
- `closest` (String): If provided, returns the first ancestor of
the given element.
- `eq` (Number): If provided, it will select the *nth* element.
- `texteq` (Number): If provided, it will select the *nth* direct text child.
Deep text child selection is not possible yet.
Overwrites the `how` key.
- `listItem` (Object): An object, keeping the recursive schema of
the `listItem` object. This can be used to create nested lists.

**Example**:
```js
{
articles: {
listItem: ".article"
, data: {
createdAt: {
selector: ".date"
, convert: x => new Date(x)
}
, title: "a.article-title"
, tags: {
listItem: ".tags > span"
}
, content: {
selector: ".article-content"
, how: "html"
}
, traverseOtherNode: {
selector: ".upperNode"
, closest: "div"
, convert: x => x.length
}
}
}
}
```

If you want to collect specific data from the page, just use the same
schema used for the `data` field.

**Example**:
```js
{
title: ".header h1"
, desc: ".header h2"
, avatar: {
selector: ".header img"
, attr: "src"
}
}
```

#### Return
- **Object** The scraped data.

## :question: Get Help

There are few ways to get help:

1. Please [post questions on Stack Overflow](https://stackoverflow.com/questions/ask). You can open issues with questions, as long you add a link to your Stack Overflow question.
2. For bug reports and feature requests, open issues. :bug:
3. For direct and quick help, you can [use Codementor](https://www.codementor.io/johnnyb). :rocket:

## :yum: How to contribute
Have an idea? Found a bug? See [how to contribute][contributing].

## :sparkling_heart: Support my projects
I open-source almost everything I can, and I try to reply to everyone needing help using these projects. Obviously,
this takes time. You can integrate and use these projects in your applications *for free*! You can even change the source code and redistribute (even resell it).

However, if you get some profit from this or just want to encourage me to continue creating stuff, there are few ways you can do it:

- Starring and sharing the projects you like :rocket:
- [![Buy me a book][badge_amazon]][amazon]—I love books! I will remember you after years if you buy me one. :grin: :book:
- [![PayPal][badge_paypal]][paypal-donations]—You can make one-time donations via PayPal. I'll probably buy a ~~coffee~~ tea. :tea:
- [![Support me on Patreon][badge_patreon]][patreon]—Set up a recurring monthly donation and you will get interesting news about what I'm doing (things that I don't share with everyone).
- **Bitcoin**—You can send me bitcoins at this address (or scanning the code below): `1P9BRsmazNQcuyTxEqveUsnf5CERdq35V6`

![](https://i.imgur.com/z6OQI95.png)

Thanks! :heart:

## :dizzy: Where is this library used?
If you are using this library in one of your projects, add it in this list. :sparkles:

- `3abn`
- `@alexjorgef/bandcamp-scraper`
- `@ben-wormald/bandcamp-scraper`
- `@bogochunas/package-shopify-crawler`
- `@lukekarrys/ebp`
- `@markab.io/node-api`
- `@thetrg/gibson`
- `@tryghost/mg-webscraper`
- `@web-master/node-web-scraper`
- `@zougui/furaffinity`
- `airport-cluj`
- `apixpress`
- `bandcamp-scraper`
- `beervana-scraper`
- `bible-scraper`
- `blankningsregistret`
- `blockchain-notifier`
- `brave-search-scraper`
- `camaleon`
- `carirs`
- `cevo-lookup`
- `cnn-market`
- `codementor`
- `codinglove-scraper`
- `covidau`
- `degusta-scrapper`
- `dncli`
- `egg-crawler`
- `fa.js`
- `flamescraper`
- `fmgo-marketdata`
- `gatsby-source-bandcamp`
- `growapi`
- `helyesiras`
- `jishon`
- `jobs-fetcher`
- `leximaven`
- `macoolka-net-scrape`
- `macoolka-network`
- `mersul-microbuzelor`
- `mersul-trenurilor`
- `mit-ocw-scraper`
- `mix-dl`
- `node-red-contrib-getdata-website`
- `node-red-contrib-scrape-it`
- `nurlresolver`
- `paklek-cli`
- `parn`
- `picarto-lib`
- `rayko-tools`
- `rs-api`
- `sahibinden`
- `sahibindenServer`
- `salesforcerelease-parser`
- `scrape-it-cli`
- `scrape-vinmonopolet`
- `scrapemyferry`
- `scrapos-worker`
- `sgdq-collector`
- `simple-ai-alpha`
- `spon-market`
- `startpage-quick-search`
- `steam-workshop-scraper`
- `trump-cabinet-picks`
- `u-pull-it-ne-parts-finder`
- `ubersetzung`
- `ui-studentsearch`
- `university-news-notifier`
- `uniwue-lernplaetze-scraper`
- `vandalen.rhyme.js`
- `wikitools`
- `yu-ncov-scrape-dxy`

## :scroll: License

[MIT][license] © [Ionică Bizău][website]

[license]: /LICENSE
[website]: https://ionicabizau.net
[contributing]: /CONTRIBUTING.md
[docs]: /DOCUMENTATION.md
[badge_patreon]: https://ionicabizau.github.io/badges/patreon.svg
[badge_amazon]: https://ionicabizau.github.io/badges/amazon.svg
[badge_paypal]: https://ionicabizau.github.io/badges/paypal.svg
[badge_paypal_donate]: https://ionicabizau.github.io/badges/paypal_donate.svg
[patreon]: https://www.patreon.com/ionicabizau
[amazon]: http://amzn.eu/hRo9sIZ
[paypal-donations]: https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=RVXDDLKKLQRJW