Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/danielnieto/scrapman

Retrieve real (with Javascript executed) HTML code from an URL, ultra fast and supports multiple parallel loading of webs
https://github.com/danielnieto/scrapman

electron javascript javascript-tools scrap scraper scraping scraping-websites

Last synced: 4 months ago
JSON representation

Retrieve real (with Javascript executed) HTML code from an URL, ultra fast and supports multiple parallel loading of webs

Host: GitHub
URL: https://github.com/danielnieto/scrapman
Owner: danielnieto
License: mit
Created: 2016-10-14T03:23:02.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2018-04-13T18:54:33.000Z (almost 7 years ago)
Last Synced: 2024-10-18T10:34:40.394Z (4 months ago)
Topics: electron, javascript, javascript-tools, scrap, scraper, scraping, scraping-websites
Language: JavaScript
Homepage:
Size: 35.2 KB
Stars: 22
Watchers: 4
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Scrapman

>*Ski-bi dibby dib yo da dub dub*


*Yo da dub dub*


*Ski-bi dibby dib yo da dub dub*


*Yo da dub dub*



***I'm the Scrapman!***

### THE FASTEST SCRAPPER EVER\*... AND IT SUPPORTS PARALLEL REQUESTS (\*arguably)

Scrapman is a blazingly fast **real (with Javascript executed)** HTML scrapper, built from the ground up to support parallel fetches, with this you can get the HTML code for 50+ URLs in seconds (~30 seconds).

On NodeJS you can easily use `request` to fetch the HTML from a page, but what if the page you are trying to load is *NOT* a static HTML page, but it has dynamic content added with Javascript? What do you do then? Well, you use ***The Scrapman***.

It uses [Electron](http://electron.atom.io) to dynamically load web pages into several `` within a single Chromium instance. This is why it fetches the HTML exactly as you would see it if you inspect the page with DevTools.

This is **NOT** an browser automation tool (yet), it's a node module that gives you the processed HTML from an URL, it focuses on multiple parallel operations and speed.

## USAGE

1.- Install it

`npm install scrapman -S`

2.- Require it

`var scrapman = require("scrapman");`

3.- Use it (as many times as you need)

Single URL request

```javascript

scrapman.load("http://google.com", function(results){

	//results contains the HTML obtained from the url

	console.log(results);

});

```

Parallel URL requests

```javascript

//yes, you can use it within a loop.

for(var i=1; i<=50; i++){

    scrapman.load("https://www.website.com/page/" + i, function(results){

        console.log(results);

    });

}

```

## API

### - scrapman.load(url, callback)

#### url

Type: `String`


The URL from which the HTML code is going to be obtained.

#### callback(results)

Type: `Function`


The callback function to be executed when the loading is done. The loaded HTML will be in the `results` parameter.

### - scrapman.configure(config)

#### config

The configuration object can set the following values

* `maxConcurrentOperations`: Integer - The intensity of processing, how many URLs can be loaded at the same time, default: 50

* `wait`: Integer - The amount of milliseconds to wait before returning the HTML code of a webpage after it has been completely loaded, default: 0

## Questions

Feel free to open Issues to ask questions about using this package, PRs are very welcomed and encouraged.

**SE HABLA ESPAÑOL**

## License

MIT © Daniel Nieto