Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/marcomontalbano/html-miner
A powerful miner that will scrape html pages for you. ` HTML Scraper ´
https://github.com/marcomontalbano/html-miner
coverage html-scraper istanbul mocha nodejs npm-package nyc scraper
Last synced: 29 days ago
JSON representation
A powerful miner that will scrape html pages for you. ` HTML Scraper ´
- Host: GitHub
- URL: https://github.com/marcomontalbano/html-miner
- Owner: marcomontalbano
- License: mit
- Created: 2017-08-12T20:38:43.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-04-28T18:30:26.000Z (7 months ago)
- Last Synced: 2024-10-05T23:14:53.870Z (about 1 month ago)
- Topics: coverage, html-scraper, istanbul, mocha, nodejs, npm-package, nyc, scraper
- Language: JavaScript
- Homepage: https://marcomontalbano.github.io/html-miner
- Size: 2.62 MB
- Stars: 4
- Watchers: 4
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
HTML Miner
==========[![Npm](https://img.shields.io/npm/v/html-miner.svg)](https://www.npmjs.com/package/html-miner)
[![Build Status](https://travis-ci.org/marcomontalbano/html-miner.svg?branch=master)](https://travis-ci.org/marcomontalbano/html-miner)
[![Coverage Status](https://coveralls.io/repos/github/marcomontalbano/html-miner/badge.svg?branch=master)](https://coveralls.io/github/marcomontalbano/html-miner?branch=master)
[![Code Climate](https://codeclimate.com/github/marcomontalbano/html-miner/badges/gpa.svg)](https://codeclimate.com/github/marcomontalbano/html-miner)
[![Issue Count](https://codeclimate.com/github/marcomontalbano/html-miner/badges/issue_count.svg)](https://codeclimate.com/github/marcomontalbano/html-miner/issues)A powerful miner that will scrape html pages for you.
## Install
[![NPM](https://nodei.co/npm/html-miner.svg)](https://nodei.co/npm/html-miner/)
```sh
# using npm
npm i --save html-miner# using yarn
yarn add html-miner
```## Example
I decided to collect common use cases inside a dedicated [EXAMPLE.md](./EXAMPLE.md). Feel free to start from **Usage** section or jump directly to **Example** page.
If you want to experiment, an [online playground](https://marcomontalbano.github.io/html-miner) is also available.
:green_book: Enjoy your reading
## Usage
### Arguments
`html-miner` accepts two arguments: `html` and `selector`.
```js
const htmlMiner = require('html-miner');// htmlMiner(html, selector);
```#### HTML
_html_ is a string and contains `html` code.
```js
let html = 'Hello Marco!';
```#### SELECTOR
_selector_ could be:
`STRING`
```js
htmlMiner(html, '.title');
//=> Hello Marco!
```If the selector extracts more elements, the result is an array:
```js
let htmlWithDivs = 'Element 1Element 2';
htmlMiner(htmlWithDivs, 'div');
//=> ['Element 1', 'Element 2']
````FUNCTION`
Read [function in detail](#function-in-detail) paragraph.
```js
htmlMiner(html, () => 'Hello everyone!');
//=> Hello everyone!htmlMiner(html, function () {
return 'Hello everyone!'
});
//=> Hello everyone!
````ARRAY`
```js
htmlMiner(html, ['.title', 'span']);
//=> ['Hello Marco!', 'Marco']
````OBJECT`
```js
htmlMiner(html, {
title: '.title',
who: 'span'
});
//=> {
// title: 'Hello Marco!',
// who: 'Marco'
// }
```You can combine `array` and `object` with each other or with string and functions.
```js
htmlMiner(html, {
title: '.title',
who: '.title span',
upper: (arg) => { return arg.scopeData.who.toUpperCase(); }
});
//=> {
// title: 'Hello Marco!',
// who: 'Marco',
// upper: 'MARCO'
// }
```### Function in detail
A `function` accepts only one argument that is an `object` containing:
- `$`: is a jQuery-like function pointing to the document ( html argument ). You can use it to query and fetch elements from the html.
```js
htmlMiner(html, arg => arg.$('.title').text());
//=> Hello Marco!
```- `$scope`: useful when combined with `_each_` or `_container_` (read [special keys](#special-keys) paragraph).
```js
htmlMiner(html, {
title: '.title',
spanList: {
_each_: 'span',
value: (arg) => {
// "arg.$scope.find('.title')" doesn't exist.
return arg.$scope.text();
}
}
});
//=> {
// title: 'Hello Marco!',
// spanList: [{
// value: 'Marco'
// }]
// }
```- `globalData`: is an object that contains all **previously** fetched datas.
```js
htmlMiner(html, {
title: '.title',
spanList: {
_each_: '.title span',
pageTitle: function(arg) {
// "arg.globalData.who" is undefined because defined later.
return arg.globalData.title;
}
},
who: '.title span'
});
//=> {
// title: 'Hello Marco!',
// spanList: [{
// pageTitle: 'Hello Marco!'
// }],
// who: 'Marco'
// }
```- `scopeData`: similar to `globalData`, but only contains scope data. Useful when combined with `_each_` (read [special keys](#special-keys) paragraph).
```js
htmlMiner(html, {
title: '.title',
upper: (arg) => { return arg.scopeData.title.toUpperCase(); },
sublist: {
who: '.title span',
upper: (arg) => {
// "arg.scopeData.title" is undefined because "title" is out of scope.
return arg.scopeData.who.toUpperCase();
},
}
});
//=> {
// title: 'Hello Marco!',
// upper: 'HELLO MARCO!',
// sublist: {
// who: 'Marco',
// upper: 'MARCO'
// }
// }
```### Special keys
When selector is an `object`, you can use _special keys_:
- `_each_`: creates a list of items. HTML Miner will iterate for the value and will parse siblings keys.
```js
{
articles: {
_each_: '.articles .article',
title: 'h2',
content: 'p',
}
}
```- `_eachId_`: useful when combined with `_each_`. Instead of creating an Array, it creates an Object where keys are the result of `_eachId_` function.
```js
{
articles: {
_each_: '.articles .article',
_eachId_: function(arg) {
return arg.$scope.data('id');
}
title: 'h2',
content: 'p',
}
}
```- `_container_`: uses the parsed value as container. HTML Miner will parse siblings keys, searching them inside the _container_.
```js
{
footer: {
_container_: 'footer',
copyright: (arg) => { return arg.$scope.text().trim(); },
company: 'span' // find only 'span' inside 'footer'.
}
}
```For more details see the following [example](#lets-try-this-out).
## Let's try this out
Consider the following html snippet: we will try and fetch some information.
```html
Hello, world!
Heading 1
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Heading 2
Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.
Heading 3
Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.
© Company 2017
```
```js
const htmlMiner = require('html-miner');let json = htmlMiner(html, {
title: 'h1',
who: 'h1 span',
h2: 'h2',
articlesArray: {
_each_: '.articles .article',
title: 'h2',
content: 'p',
},
articlesObject: {
_each_: '.articles .article',
_eachId_: function(arg) {
return arg.$scope.data('id');
},
title: 'h2',
content: 'p',
},
footer: {
_container_: 'footer',
copyright: (arg) => { return arg.$scope.text().trim(); },
company: 'span',
year: (arg) => { return arg.scopeData.copyright.match(/[0-9]+/)[0]; },
},
greet: () => { return 'Hi!'; }
});console.log( json );
//=> {
// title: 'Hello, world!',
// who: 'world',
// h2: ['Heading 1', 'Heading 2', 'Heading 3'],
// articlesArray: [
// {
// title: 'Heading 1',
// content: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
// },
// {
// title: 'Heading 2',
// content: 'Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.',
// },
// {
// title: 'Heading 3',
// content: 'Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.',
// }
// ],
// articlesObject: {
// 'a001': {
// title: 'Heading 1',
// content: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
// },
// 'a002': {
// title: 'Heading 2',
// content: 'Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.',
// },
// 'a003': {
// title: 'Heading 3',
// content: 'Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.',
// }
// },
// footer: {
// copyright: '© Company 2017',
// company: 'Company',
// year: '2017'
// },
// greet: 'Hi!'
// }```
You can find other examples under the folder `/examples`
```sh
# you can test examples with nodejs
node examples/demo.js
node examples/site.js
```## Development
```sh
npm install
npm test# start the playground locally
npm start
```