{"id":15398430,"url":"https://github.com/marcomontalbano/html-miner","last_synced_at":"2025-04-16T01:23:29.412Z","repository":{"id":23902427,"uuid":"100136689","full_name":"marcomontalbano/html-miner","owner":"marcomontalbano","description":"A powerful miner that will scrape html pages for you. ` HTML Scraper ´","archived":false,"fork":false,"pushed_at":"2024-04-28T18:30:26.000Z","size":2751,"stargazers_count":4,"open_issues_count":2,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-13T00:46:53.024Z","etag":null,"topics":["coverage","html-scraper","istanbul","mocha","nodejs","npm-package","nyc","scraper"],"latest_commit_sha":null,"homepage":"https://marcomontalbano.github.io/html-miner","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/marcomontalbano.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-08-12T20:38:43.000Z","updated_at":"2024-05-12T08:31:03.000Z","dependencies_parsed_at":"2024-04-28T19:33:21.899Z","dependency_job_id":"ed499bb5-5252-4562-b5e8-1643dfb2bd19","html_url":"https://github.com/marcomontalbano/html-miner","commit_stats":{"total_commits":104,"total_committers":2,"mean_commits":52.0,"dds":"0.038461538461538436","last_synced_commit":"8683b4f3dd22bd6e456ce181cac470263a750419"},"previous_names":[],"tags_count":24,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marcomontalbano%2Fhtml-miner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marcomontalbano%2Fhtml-miner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marcomontalbano%2Fhtml-miner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marcomontalbano%2Fhtml-miner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/marcomontalbano","download_url":"https://codeload.github.com/marcomontalbano/html-miner/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249179714,"owners_count":21225587,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["coverage","html-scraper","istanbul","mocha","nodejs","npm-package","nyc","scraper"],"created_at":"2024-10-01T15:43:43.464Z","updated_at":"2025-04-16T01:23:29.394Z","avatar_url":"https://github.com/marcomontalbano.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"HTML Miner\n==========\n\n[![Npm](https://img.shields.io/npm/v/html-miner.svg)](https://www.npmjs.com/package/html-miner)\n[![Build Status](https://travis-ci.org/marcomontalbano/html-miner.svg?branch=master)](https://travis-ci.org/marcomontalbano/html-miner)\n[![Coverage Status](https://coveralls.io/repos/github/marcomontalbano/html-miner/badge.svg?branch=master)](https://coveralls.io/github/marcomontalbano/html-miner?branch=master)\n[![Code Climate](https://codeclimate.com/github/marcomontalbano/html-miner/badges/gpa.svg)](https://codeclimate.com/github/marcomontalbano/html-miner)\n[![Issue Count](https://codeclimate.com/github/marcomontalbano/html-miner/badges/issue_count.svg)](https://codeclimate.com/github/marcomontalbano/html-miner/issues)\n\nA powerful miner that will scrape html pages for you.\n\n## Install\n\n[![NPM](https://nodei.co/npm/html-miner.svg)](https://nodei.co/npm/html-miner/)\n\n```sh\n# using npm\nnpm i --save html-miner\n\n# using yarn\nyarn add html-miner\n```\n\n## Example\n\nI decided to collect common use cases inside a dedicated [EXAMPLE.md](./EXAMPLE.md). Feel free to start from **Usage** section or jump directly to **Example** page.\n\nIf you want to experiment, an [online playground](https://marcomontalbano.github.io/html-miner) is also available.\n\n\n:green_book: Enjoy your reading\n\n## Usage\n\n### Arguments\n\n`html-miner` accepts two arguments: `html` and `selector`.\n\n```js\nconst htmlMiner = require('html-miner');\n\n// htmlMiner(html, selector);\n```\n\n#### HTML\n\n_html_ is a string and contains `html` code.\n\n```js\nlet html = '\u003cdiv class=\"title\"\u003eHello \u003cspan\u003eMarco\u003c/span\u003e!\u003c/div\u003e';\n```\n\n#### SELECTOR\n\n_selector_ could be:\n\n`STRING`\n\n```js\nhtmlMiner(html, '.title');\n//=\u003e Hello Marco!\n```\n\nIf the selector extracts more elements, the result is an array:\n\n```js\nlet htmlWithDivs = '\u003cdiv\u003eElement 1\u003c/div\u003e\u003cdiv\u003eElement 2\u003c/div\u003e';\nhtmlMiner(htmlWithDivs, 'div');\n//=\u003e ['Element 1', 'Element 2']\n```\n\n`FUNCTION`\n\nRead [function in detail](#function-in-detail) paragraph.\n\n```js\nhtmlMiner(html, () =\u003e 'Hello everyone!');\n//=\u003e Hello everyone!\n\nhtmlMiner(html, function () {\n    return 'Hello everyone!'\n});\n//=\u003e Hello everyone!\n```\n\n`ARRAY`\n\n```js\nhtmlMiner(html, ['.title', 'span']);\n//=\u003e ['Hello Marco!', 'Marco']\n```\n\n`OBJECT`\n\n```js\nhtmlMiner(html, {\n    title: '.title',\n    who: 'span'\n});\n//=\u003e {\n//     title: 'Hello Marco!',\n//     who: 'Marco'\n//   }\n```\n\nYou can combine `array` and `object` with each other or with string and functions.\n\n```js\nhtmlMiner(html, {\n    title: '.title',\n    who: '.title span',\n    upper: (arg) =\u003e { return arg.scopeData.who.toUpperCase(); }\n});\n//=\u003e {\n//     title: 'Hello Marco!',\n//     who: 'Marco',\n//     upper: 'MARCO'\n//   }\n```\n\n\n### Function in detail\n\nA `function` accepts only one argument that is an `object` containing:\n\n- `$`: is a jQuery-like function pointing to the document ( html argument ). You can use it to query and fetch elements from the html.\n\n    ```js\n    htmlMiner(html, arg =\u003e arg.$('.title').text());\n    //=\u003e Hello Marco!\n    ```\n\n- `$scope`: useful when combined with `_each_` or `_container_` (read [special keys](#special-keys) paragraph).\n\n    ```js\n    htmlMiner(html, {\n        title: '.title',\n        spanList: {\n            _each_: 'span',\n            value: (arg) =\u003e {\n                // \"arg.$scope.find('.title')\" doesn't exist.\n                return arg.$scope.text();\n            }\n        }\n    });\n    //=\u003e {\n    //     title: 'Hello Marco!',\n    //     spanList: [{\n    //         value: 'Marco'\n    //     }]\n    //   }\n    ```\n\n- `globalData`: is an object that contains all **previously** fetched datas.\n\n    ```js\n    htmlMiner(html, {\n        title: '.title',\n        spanList: {\n            _each_: '.title span',\n            pageTitle: function(arg) {\n                // \"arg.globalData.who\" is undefined because defined later.\n                return arg.globalData.title;\n            }\n        },\n        who: '.title span'\n    });\n    //=\u003e {\n    //     title: 'Hello Marco!',\n    //     spanList: [{\n    //         pageTitle: 'Hello Marco!'\n    //     }],\n    //     who: 'Marco'\n    //   }\n    ```\n\n- `scopeData`: similar to `globalData`, but only contains scope data. Useful when combined with `_each_` (read [special keys](#special-keys) paragraph).\n\n    ```js\n    htmlMiner(html, {\n        title: '.title',\n        upper: (arg) =\u003e { return arg.scopeData.title.toUpperCase(); },\n        sublist: {\n            who: '.title span',\n            upper: (arg) =\u003e {\n                // \"arg.scopeData.title\" is undefined because \"title\" is out of scope.\n                return arg.scopeData.who.toUpperCase();\n            },\n        }\n    });\n    //=\u003e {\n    //     title: 'Hello Marco!',\n    //     upper: 'HELLO MARCO!',\n    //     sublist: {\n    //         who: 'Marco',\n    //         upper: 'MARCO'\n    //     }\n    //   }\n    ```\n\n\n### Special keys\n\nWhen selector is an `object`, you can use _special keys_: \n\n- `_each_`: creates a list of items. HTML Miner will iterate for the value and will parse siblings keys.\n\n    ```js\n    {\n        articles: {\n            _each_: '.articles .article',\n            title: 'h2',\n            content: 'p',\n        }\n    }\n    ```\n\n- `_eachId_`: useful when combined with `_each_`. Instead of creating an Array, it creates an Object where keys are the result of `_eachId_` function.\n\n    ```js\n    {\n        articles: {\n            _each_: '.articles .article',\n            _eachId_: function(arg) {\n                return arg.$scope.data('id');\n            }\n            title: 'h2',\n            content: 'p',\n        }\n    }\n    ```\n\n- `_container_`: uses the parsed value as container. HTML Miner will parse siblings keys, searching them inside the _container_.\n\n    ```js\n    {\n        footer: {\n            _container_: 'footer',\n            copyright: (arg) =\u003e { return arg.$scope.text().trim(); },\n            company: 'span' // find only 'span' inside 'footer'.\n        }\n    }\n    ```\n\nFor more details see the following [example](#lets-try-this-out).\n\n\n## Let's try this out\n\nConsider the following html snippet: we will try and fetch some information.\n\n```html\n\u003ch1\u003eHello, \u003cspan\u003eworld\u003c/span\u003e!\u003c/h1\u003e\n\u003cdiv class=\"articles\"\u003e\n    \u003cdiv class=\"article\" data-id=\"a001\"\u003e\n        \u003ch2\u003eHeading 1\u003c/h2\u003e\n        \u003cp\u003eLorem ipsum dolor sit amet, consectetur adipiscing elit.\u003c/p\u003e\n    \u003c/div\u003e\n    \u003cdiv class=\"article\" data-id=\"a002\"\u003e\n        \u003ch2\u003eHeading 2\u003c/h2\u003e\n        \u003cp\u003eDonec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.\u003c/p\u003e\n    \u003c/div\u003e\n    \u003cdiv class=\"article\" data-id=\"a003\"\u003e\n        \u003ch2\u003eHeading 3\u003c/h2\u003e\n        \u003cp\u003eSuspendisse viverra convallis risus, vitae molestie est tincidunt eget.\u003c/p\u003e\n    \u003c/div\u003e\n\u003c/div\u003e\n\u003cfooter\u003e\n    \u003cp\u003e\u0026copy; \u003cspan\u003eCompany\u003c/span\u003e 2017\u003c/p\u003e\n\u003c/footer\u003e\n```\n\n```js\nconst htmlMiner = require('html-miner');\n\nlet json = htmlMiner(html, {\n    title: 'h1',\n    who: 'h1 span',\n    h2: 'h2',\n    articlesArray: {\n        _each_: '.articles .article',\n        title: 'h2',\n        content: 'p',\n    },\n    articlesObject: {\n        _each_: '.articles .article',\n        _eachId_: function(arg) {\n            return arg.$scope.data('id');\n        },\n        title: 'h2',\n        content: 'p',\n    },\n    footer: {\n        _container_: 'footer',\n        copyright: (arg) =\u003e { return arg.$scope.text().trim(); },\n        company: 'span',\n        year: (arg) =\u003e { return arg.scopeData.copyright.match(/[0-9]+/)[0]; },\n    },\n    greet: () =\u003e { return 'Hi!'; }\n});\n\nconsole.log( json );\n\n//=\u003e {\n//     title: 'Hello, world!',\n//     who: 'world',\n//     h2: ['Heading 1', 'Heading 2', 'Heading 3'],\n//     articlesArray: [\n//         {\n//             title: 'Heading 1',\n//             content: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',\n//         },\n//         {\n//             title: 'Heading 2',\n//             content: 'Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.',\n//         },\n//         {\n//             title: 'Heading 3',\n//             content: 'Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.',\n//         }\n//     ],\n//     articlesObject: {\n//         'a001': {\n//             title: 'Heading 1',\n//             content: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',\n//         },\n//         'a002': {\n//             title: 'Heading 2',\n//             content: 'Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.',\n//         },\n//         'a003': {\n//             title: 'Heading 3',\n//             content: 'Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.',\n//         }\n//     },\n//     footer: {\n//         copyright: '© Company 2017',\n//         company: 'Company',\n//         year: '2017'\n//     },\n//     greet: 'Hi!'\n//   }\n\n```\n\nYou can find other examples under the folder `/examples`\n```sh\n# you can test examples with nodejs\nnode examples/demo.js\nnode examples/site.js\n```\n\n\n## Development\n\n```sh\nnpm install\nnpm test\n\n# start the playground locally\nnpm start\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarcomontalbano%2Fhtml-miner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmarcomontalbano%2Fhtml-miner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarcomontalbano%2Fhtml-miner/lists"}