Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/danielnieto/scrapman

Retrieve real (with Javascript executed) HTML code from an URL, ultra fast and supports multiple parallel loading of webs
https://github.com/danielnieto/scrapman

electron javascript javascript-tools scrap scraper scraping scraping-websites

Last synced: 4 months ago
JSON representation

Retrieve real (with Javascript executed) HTML code from an URL, ultra fast and supports multiple parallel loading of webs

Awesome Lists containing this project

README

        

# Scrapman

>*Ski-bi dibby dib yo da dub dub*

*Yo da dub dub*

*Ski-bi dibby dib yo da dub dub*

*Yo da dub dub*


***I'm the Scrapman!***

### THE FASTEST SCRAPPER EVER\*... AND IT SUPPORTS PARALLEL REQUESTS (\*arguably)

Scrapman is a blazingly fast **real (with Javascript executed)** HTML scrapper, built from the ground up to support parallel fetches, with this you can get the HTML code for 50+ URLs in seconds (~30 seconds).

On NodeJS you can easily use `request` to fetch the HTML from a page, but what if the page you are trying to load is *NOT* a static HTML page, but it has dynamic content added with Javascript? What do you do then? Well, you use ***The Scrapman***.

It uses [Electron](http://electron.atom.io) to dynamically load web pages into several `` within a single Chromium instance. This is why it fetches the HTML exactly as you would see it if you inspect the page with DevTools.

This is **NOT** an browser automation tool (yet), it's a node module that gives you the processed HTML from an URL, it focuses on multiple parallel operations and speed.

## USAGE

1.- Install it

`npm install scrapman -S`

2.- Require it

`var scrapman = require("scrapman");`

3.- Use it (as many times as you need)

Single URL request

```javascript
scrapman.load("http://google.com", function(results){
//results contains the HTML obtained from the url
console.log(results);
});
```
Parallel URL requests

```javascript
//yes, you can use it within a loop.
for(var i=1; i<=50; i++){
scrapman.load("https://www.website.com/page/" + i, function(results){
console.log(results);
});
}
```

## API

### - scrapman.load(url, callback)

#### url
Type: `String`

The URL from which the HTML code is going to be obtained.

#### callback(results)
Type: `Function`

The callback function to be executed when the loading is done. The loaded HTML will be in the `results` parameter.

### - scrapman.configure(config)

#### config
The configuration object can set the following values

* `maxConcurrentOperations`: Integer - The intensity of processing, how many URLs can be loaded at the same time, default: 50

* `wait`: Integer - The amount of milliseconds to wait before returning the HTML code of a webpage after it has been completely loaded, default: 0

## Questions
Feel free to open Issues to ask questions about using this package, PRs are very welcomed and encouraged.

**SE HABLA ESPAÑOL**

## License

MIT © Daniel Nieto