Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/danielnieto/scrapman
Retrieve real (with Javascript executed) HTML code from an URL, ultra fast and supports multiple parallel loading of webs
https://github.com/danielnieto/scrapman
electron javascript javascript-tools scrap scraper scraping scraping-websites
Last synced: 4 months ago
JSON representation
Retrieve real (with Javascript executed) HTML code from an URL, ultra fast and supports multiple parallel loading of webs
- Host: GitHub
- URL: https://github.com/danielnieto/scrapman
- Owner: danielnieto
- License: mit
- Created: 2016-10-14T03:23:02.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2018-04-13T18:54:33.000Z (almost 7 years ago)
- Last Synced: 2024-10-18T10:34:40.394Z (4 months ago)
- Topics: electron, javascript, javascript-tools, scrap, scraper, scraping, scraping-websites
- Language: JavaScript
- Homepage:
- Size: 35.2 KB
- Stars: 22
- Watchers: 4
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Scrapman
>*Ski-bi dibby dib yo da dub dub*
*Yo da dub dub*
*Ski-bi dibby dib yo da dub dub*
*Yo da dub dub*
***I'm the Scrapman!***### THE FASTEST SCRAPPER EVER\*... AND IT SUPPORTS PARALLEL REQUESTS (\*arguably)
Scrapman is a blazingly fast **real (with Javascript executed)** HTML scrapper, built from the ground up to support parallel fetches, with this you can get the HTML code for 50+ URLs in seconds (~30 seconds).
On NodeJS you can easily use `request` to fetch the HTML from a page, but what if the page you are trying to load is *NOT* a static HTML page, but it has dynamic content added with Javascript? What do you do then? Well, you use ***The Scrapman***.
It uses [Electron](http://electron.atom.io) to dynamically load web pages into several `` within a single Chromium instance. This is why it fetches the HTML exactly as you would see it if you inspect the page with DevTools.
This is **NOT** an browser automation tool (yet), it's a node module that gives you the processed HTML from an URL, it focuses on multiple parallel operations and speed.
## USAGE
1.- Install it
`npm install scrapman -S`
2.- Require it
`var scrapman = require("scrapman");`
3.- Use it (as many times as you need)
Single URL request
```javascript
scrapman.load("http://google.com", function(results){
//results contains the HTML obtained from the url
console.log(results);
});
```
Parallel URL requests```javascript
//yes, you can use it within a loop.
for(var i=1; i<=50; i++){
scrapman.load("https://www.website.com/page/" + i, function(results){
console.log(results);
});
}
```## API
### - scrapman.load(url, callback)
#### url
Type: `String`The URL from which the HTML code is going to be obtained.
#### callback(results)
Type: `Function`The callback function to be executed when the loading is done. The loaded HTML will be in the `results` parameter.
### - scrapman.configure(config)
#### config
The configuration object can set the following values* `maxConcurrentOperations`: Integer - The intensity of processing, how many URLs can be loaded at the same time, default: 50
* `wait`: Integer - The amount of milliseconds to wait before returning the HTML code of a webpage after it has been completely loaded, default: 0
## Questions
Feel free to open Issues to ask questions about using this package, PRs are very welcomed and encouraged.**SE HABLA ESPAÑOL**
## License
MIT © Daniel Nieto