Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/marcomontalbano/html-miner

A powerful miner that will scrape html pages for you. ` HTML Scraper ´
https://github.com/marcomontalbano/html-miner

coverage html-scraper istanbul mocha nodejs npm-package nyc scraper

Last synced: 29 days ago
JSON representation

A powerful miner that will scrape html pages for you. ` HTML Scraper ´

Awesome Lists containing this project

README

        

HTML Miner
==========

[![Npm](https://img.shields.io/npm/v/html-miner.svg)](https://www.npmjs.com/package/html-miner)
[![Build Status](https://travis-ci.org/marcomontalbano/html-miner.svg?branch=master)](https://travis-ci.org/marcomontalbano/html-miner)
[![Coverage Status](https://coveralls.io/repos/github/marcomontalbano/html-miner/badge.svg?branch=master)](https://coveralls.io/github/marcomontalbano/html-miner?branch=master)
[![Code Climate](https://codeclimate.com/github/marcomontalbano/html-miner/badges/gpa.svg)](https://codeclimate.com/github/marcomontalbano/html-miner)
[![Issue Count](https://codeclimate.com/github/marcomontalbano/html-miner/badges/issue_count.svg)](https://codeclimate.com/github/marcomontalbano/html-miner/issues)

A powerful miner that will scrape html pages for you.

## Install

[![NPM](https://nodei.co/npm/html-miner.svg)](https://nodei.co/npm/html-miner/)

```sh
# using npm
npm i --save html-miner

# using yarn
yarn add html-miner
```

## Example

I decided to collect common use cases inside a dedicated [EXAMPLE.md](./EXAMPLE.md). Feel free to start from **Usage** section or jump directly to **Example** page.

If you want to experiment, an [online playground](https://marcomontalbano.github.io/html-miner) is also available.

:green_book: Enjoy your reading

## Usage

### Arguments

`html-miner` accepts two arguments: `html` and `selector`.

```js
const htmlMiner = require('html-miner');

// htmlMiner(html, selector);
```

#### HTML

_html_ is a string and contains `html` code.

```js
let html = '

Hello Marco!
';
```

#### SELECTOR

_selector_ could be:

`STRING`

```js
htmlMiner(html, '.title');
//=> Hello Marco!
```

If the selector extracts more elements, the result is an array:

```js
let htmlWithDivs = '

Element 1
Element 2
';
htmlMiner(htmlWithDivs, 'div');
//=> ['Element 1', 'Element 2']
```

`FUNCTION`

Read [function in detail](#function-in-detail) paragraph.

```js
htmlMiner(html, () => 'Hello everyone!');
//=> Hello everyone!

htmlMiner(html, function () {
return 'Hello everyone!'
});
//=> Hello everyone!
```

`ARRAY`

```js
htmlMiner(html, ['.title', 'span']);
//=> ['Hello Marco!', 'Marco']
```

`OBJECT`

```js
htmlMiner(html, {
title: '.title',
who: 'span'
});
//=> {
// title: 'Hello Marco!',
// who: 'Marco'
// }
```

You can combine `array` and `object` with each other or with string and functions.

```js
htmlMiner(html, {
title: '.title',
who: '.title span',
upper: (arg) => { return arg.scopeData.who.toUpperCase(); }
});
//=> {
// title: 'Hello Marco!',
// who: 'Marco',
// upper: 'MARCO'
// }
```

### Function in detail

A `function` accepts only one argument that is an `object` containing:

- `$`: is a jQuery-like function pointing to the document ( html argument ). You can use it to query and fetch elements from the html.

```js
htmlMiner(html, arg => arg.$('.title').text());
//=> Hello Marco!
```

- `$scope`: useful when combined with `_each_` or `_container_` (read [special keys](#special-keys) paragraph).

```js
htmlMiner(html, {
title: '.title',
spanList: {
_each_: 'span',
value: (arg) => {
// "arg.$scope.find('.title')" doesn't exist.
return arg.$scope.text();
}
}
});
//=> {
// title: 'Hello Marco!',
// spanList: [{
// value: 'Marco'
// }]
// }
```

- `globalData`: is an object that contains all **previously** fetched datas.

```js
htmlMiner(html, {
title: '.title',
spanList: {
_each_: '.title span',
pageTitle: function(arg) {
// "arg.globalData.who" is undefined because defined later.
return arg.globalData.title;
}
},
who: '.title span'
});
//=> {
// title: 'Hello Marco!',
// spanList: [{
// pageTitle: 'Hello Marco!'
// }],
// who: 'Marco'
// }
```

- `scopeData`: similar to `globalData`, but only contains scope data. Useful when combined with `_each_` (read [special keys](#special-keys) paragraph).

```js
htmlMiner(html, {
title: '.title',
upper: (arg) => { return arg.scopeData.title.toUpperCase(); },
sublist: {
who: '.title span',
upper: (arg) => {
// "arg.scopeData.title" is undefined because "title" is out of scope.
return arg.scopeData.who.toUpperCase();
},
}
});
//=> {
// title: 'Hello Marco!',
// upper: 'HELLO MARCO!',
// sublist: {
// who: 'Marco',
// upper: 'MARCO'
// }
// }
```

### Special keys

When selector is an `object`, you can use _special keys_:

- `_each_`: creates a list of items. HTML Miner will iterate for the value and will parse siblings keys.

```js
{
articles: {
_each_: '.articles .article',
title: 'h2',
content: 'p',
}
}
```

- `_eachId_`: useful when combined with `_each_`. Instead of creating an Array, it creates an Object where keys are the result of `_eachId_` function.

```js
{
articles: {
_each_: '.articles .article',
_eachId_: function(arg) {
return arg.$scope.data('id');
}
title: 'h2',
content: 'p',
}
}
```

- `_container_`: uses the parsed value as container. HTML Miner will parse siblings keys, searching them inside the _container_.

```js
{
footer: {
_container_: 'footer',
copyright: (arg) => { return arg.$scope.text().trim(); },
company: 'span' // find only 'span' inside 'footer'.
}
}
```

For more details see the following [example](#lets-try-this-out).

## Let's try this out

Consider the following html snippet: we will try and fetch some information.

```html

Hello, world!




Heading 1


Lorem ipsum dolor sit amet, consectetur adipiscing elit.




Heading 2


Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.




Heading 3


Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.



© Company 2017

```

```js
const htmlMiner = require('html-miner');

let json = htmlMiner(html, {
title: 'h1',
who: 'h1 span',
h2: 'h2',
articlesArray: {
_each_: '.articles .article',
title: 'h2',
content: 'p',
},
articlesObject: {
_each_: '.articles .article',
_eachId_: function(arg) {
return arg.$scope.data('id');
},
title: 'h2',
content: 'p',
},
footer: {
_container_: 'footer',
copyright: (arg) => { return arg.$scope.text().trim(); },
company: 'span',
year: (arg) => { return arg.scopeData.copyright.match(/[0-9]+/)[0]; },
},
greet: () => { return 'Hi!'; }
});

console.log( json );

//=> {
// title: 'Hello, world!',
// who: 'world',
// h2: ['Heading 1', 'Heading 2', 'Heading 3'],
// articlesArray: [
// {
// title: 'Heading 1',
// content: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
// },
// {
// title: 'Heading 2',
// content: 'Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.',
// },
// {
// title: 'Heading 3',
// content: 'Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.',
// }
// ],
// articlesObject: {
// 'a001': {
// title: 'Heading 1',
// content: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
// },
// 'a002': {
// title: 'Heading 2',
// content: 'Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.',
// },
// 'a003': {
// title: 'Heading 3',
// content: 'Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.',
// }
// },
// footer: {
// copyright: '© Company 2017',
// company: 'Company',
// year: '2017'
// },
// greet: 'Hi!'
// }

```

You can find other examples under the folder `/examples`
```sh
# you can test examples with nodejs
node examples/demo.js
node examples/site.js
```

## Development

```sh
npm install
npm test

# start the playground locally
npm start
```