{"id":21065051,"url":"https://github.com/prolificinteractive/node-html-to-json","last_synced_at":"2025-04-09T23:20:07.910Z","repository":{"id":57267661,"uuid":"27499027","full_name":"prolificinteractive/node-html-to-json","owner":"prolificinteractive","description":"Parses HTML strings into objects using flexible, composable filters.","archived":false,"fork":false,"pushed_at":"2017-06-28T14:40:34.000Z","size":144,"stargazers_count":121,"open_issues_count":7,"forks_count":14,"subscribers_count":34,"default_branch":"master","last_synced_at":"2025-04-02T22:08:52.401Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/prolificinteractive.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-12-03T17:32:48.000Z","updated_at":"2025-01-07T06:16:42.000Z","dependencies_parsed_at":"2022-09-02T05:40:59.566Z","dependency_job_id":null,"html_url":"https://github.com/prolificinteractive/node-html-to-json","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prolificinteractive%2Fnode-html-to-json","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prolificinteractive%2Fnode-html-to-json/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prolificinteractive%2Fnode-html-to-json/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prolificinteractive%2Fnode-html-to-json/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/prolificinteractive","download_url":"https://codeload.github.com/prolificinteractive/node-html-to-json/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248126061,"owners_count":21051862,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T17:53:18.363Z","updated_at":"2025-04-09T23:20:07.867Z","avatar_url":"https://github.com/prolificinteractive.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"![HTML to JSON](html-to-json.jpg)\n\nParses HTML strings into objects using flexible, composable filters.\n\n## Installation\n\n`npm install html-to-json`\n\n## htmlToJson.parse(html, filter, [callback]) -\u003e promise\n\nThe `parse()` method takes a string of HTML, and a filter, and responds with the filtered data. This supports both callbacks and promises.\n\n```javascript\nvar promise = htmlToJson.parse('\u003cdiv\u003econtent\u003c/div\u003e', {\n  'text': function ($doc) {\n    return $doc.find('div').text();\n  }\n}, function (err, result) {\n  console.log(result);\n});\n\npromise.done(function (result) {\n  //Works as well\n});\n```\n\n## htmlToJson.request(requestOptions, filter, [callback]) -\u003e promise\n\nThe `request()` method takes options for a call to the [request](https://github.com/request/request) library and a filter, then returns the filtered response body.\n\n```javascript\nvar promise = htmlToJson.request('http://prolificinteractive.com/team', {\n  'images': ['img', function ($img) {\n    return $img.attr('src');\n  }]\n}, function (err, result) {\n  console.log(result);\n});\n```\n\n## htmlToJson.batch(html, dictionary, [callback]) -\u003e promise\n\nPerforms many parsing operations against one HTML string. This transforms the HTML into a DOM only once instead of for each filter in the dictionary, which can quickly get expensive in terms of processing. This also allows you to break your filters up into more granular components and mix and match them as you please.\n\nThe values in the dictionary can be `htmlToJson.Parser` objects, generated methods from `htmlToJson.createMethod`, or naked filters that you might normally pass into `htmlToJson.parse`. For example:\n\n```javascript\nreturn getProlificHomepage().then(function (html) {\n  return htmlToJson.batch(html, {\n    sections: htmlToJson.createParser(['#primary-nav a', {\n      'name': function ($section) {\n        return $section.text();\n      },\n      'link': function ($section) {\n        return $section.attr('href');\n      }\n    }]),\n    offices: htmlToJson.createMethod(['.office', {\n      'location': function ($office) {\n        return $office.find('.location').text();\n      },\n      'phone': function ($office) {\n        return $office.find('.phone').text();\n      }\n    }]),\n    socialInfo: ['#footer .social-link', {\n      'name': function ($link) {\n        return $link.text();\n      },\n      'link': function ($link) {\n        return $link.attr('href');\n      }\n    }]\n  });\n});\n```\n\n## htmlToJson.createMethod(filter) -\u003e function (html, [callback])\n\nGenerates a method that wraps the passed `filter` argument. The generated method takes an HTML string and processes it against that `filter`.\n\n```javascript\nvar parseFoo = htmlToJson.createMethod({\n  'foo': function ($doc) {\n    return $doc.find('#foo').bar();\n  }\n});\n```\n\n## htmlToJson.createParser(filter), new htmlToJson.Parser(filter)\n\nFor the sake of reusability, creates an object with `.parse` and `.request` helper methods, which use the passed filter. For example:\n\n```javascript\nvar linkParser = htmlToJson.createParser(['a[href]', {\n  'text': function ($a) {\n    return $a.text();\n  },\n  'href': function ($a) {\n    return $a.attr('href');\n  }\n}]);\n\nlinkParser.request('http://prolificinteractive.com').done(function (links) {\n  //Do stuff with links\n});\n```\n\nis equivalent to:\n\n```javascript\nlinkParser.request('http://prolificinteractive.com', ['a[href]', {\n  'text': function ($a) {\n    return $a.text();\n  },\n  'href': function ($a) {\n    return $a.attr('href');\n  }\n}]).done(function (links) {\n  //Do stuff with links\n});\n```\n\nThe former allows you to easily reuse the filter (and make it testable), while that latter is a one-off.\n\n### parser.parse(html, [callback])\n\nParses the passed html argument against the parser's filter.\n\n### parser.method(html, [callback])\n\nReturns a method that wraps `parser.parse()`\n\n### parser.request(requestOptions, [callback])\n\nMakes a request with the request options, then runs the response body through the parser's filter.\n\n## Filter Types\n\n### Functions\n\nThe return values of functions are mapped against their corresponding keys. Function filters are passed [cheerio](https://github.com/cheeriojs/cheerio) objects, which allows you to play with a jQuery-like interface.\n\n```javascript\nhtmlToJson.parse('\u003cdiv id=\"foo\"\u003efoo\u003c/div\u003e', {\n  'foo1': function ($doc, $) {\n    return $doc.find('#foo').text(); //foo\n  }\n}, callback);\n```\n\n### Arrays\n\nArrays of data can be parsed out by either using the .map() method within a filter function or using the shorthand [selector, filter] syntax:\n\n#### .map(selector, filter)\n\nA filter is applied incrementally against each matched element, and the results are returned within an array.\n\n```javascript\nvar html = '\u003cdiv id=\"items\"\u003e\u003cdiv class=\"item\"\u003e1\u003c/div\u003e\u003cdiv class=\"item\"\u003e2\u003c/div\u003e\u003c/div\u003e';\n\nhtmlToJson.parse(html, function () {\n  return this.map('.item', function ($item) {\n    return $item.text();\n  });\n}).done(function (items) {\n  // Items should be: ['1','2']\n}, function (err) {\n  // Handle error\n});\n```\n\n#### [selector, filter, after]\n\nThis is essentially a short-hand alias for `.map()`, making the filter look more like its output:\n\n```javascript\nvar html = '\u003cdiv id=\"items\"\u003e\u003cdiv class=\"item\"\u003e1\u003c/div\u003e\u003cdiv class=\"item\"\u003e2\u003c/div\u003e\u003c/div\u003e';\n\nhtmlToJson\n  .parse(html, ['.item', function ($item) {\n    return $item.text();\n  }])\n  .done(function (items) {\n    // Items should be: ['1','2']\n  }, function (err) {\n    // Handle error\n  });\n```\n\nAs an added convenience you can pass in a 3rd argument into the array filter, which allows you to manipulate the results. You can return a promise if you wish to do an asynchronous operation.\n\n```javascript\nvar html = '\u003cdiv id=\"items\"\u003e\u003cdiv class=\"item\"\u003e1\u003c/div\u003e\u003cdiv class=\"item\"\u003e2\u003c/div\u003e\u003c/div\u003e';\n\nhtmlToJson\n  .parse(html, ['.item', function ($item) {\n    return +$item.text();\n  }, function (items) {\n    return _.map(items, function (item) {\n      return item * 3;\n    });\n  }])\n  .done(function (items) {\n    // Items should be: [3,6]\n  }, function (err) {\n    // Handle error\n  });\n```\n\n### Asynchronous filters\n\nFilter functions may also return promises, which get resolved asynchronously.\n\n```javascript\nfunction getProductDetails (id, callback) {\n  return htmlToJson.request({\n    uri: 'http://store.prolificinteractive.com/products/' + id\n  }, {\n    'id': function ($doc) {\n      return $doc.find('#product-details').attr('data-id');\n    },\n    'colors': ['.color', {\n      'id': function ($color) {\n        return $color.attr('data-id');\n      },\n      'hex': function ($color) {\n        return $color.css('background-color');\n      }\n    }]\n  }, callback);\n}\n\nfunction getProducts (callback) {\n  return htmlToJson.request({\n    uri: 'http://store.prolificinteractive.com'\n  }, ['.product', {\n    'id': function ($product) {\n      return $product.attr('data-id');\n    },\n    'image': function ($product) {\n      return $product.find('img').attr('src');\n    },\n    'colors': function ($product) {\n      // This is where we use a promise to get the colors asynchronously\n      return this\n        .get('id')\n        .then(function (id) {\n          return getProductDetails(id).get('colors');\n        });\n    }\n  }], callback);\n}\n```\n\n### Dependencies on other values\n\nFilter functions may use the `.get(propertyName)` to use a value from another key in that filter. This returns a promise representing the value rather than the value itself.\n\n```javascript\nfunction getProducts (callback) {\n  return htmlToJson.request('http://store.prolificinteractive.com', ['.product', {\n    'id': function ($product) {\n      return $product.attr('data-id');\n    },\n    'image': function ($product) {\n      return $product.find('img').attr('src');\n    },\n    'colors': function ($product) {\n      // Resolve 'id' then get product details with it\n      return this\n        .get('id')\n        .then(function (id) {\n          return getProductDetails(id).get('colors');\n        });\n    }\n  }], callback);\n}\n```\n\n### Objects\n\nNested objects within a filter are run against the same HTML context as the parent filter.\n\n```javascript\nvar html = '\u003cdiv id=\"foo\"\u003e\u003cdiv id=\"bar\"\u003efoobar\u003c/div\u003e\u003c/div\u003e';\n\nhtmlToJson.parse(html, {\n  'foo': {\n    'bar': function ($doc) {\n      return $doc.find('#bar').text();\n    }\n  }\n});\n```\n\n#### $container modifier\n\nYou may specify a more specific DOM context by setting the $container property on the object filter:\n\n```javascript\nvar html = '\u003cdiv id=\"foo\"\u003e\u003cdiv id=\"bar\"\u003efoobar\u003c/div\u003e\u003c/div\u003e';\n\nhtmlToJson.parse(html, {\n  'foo': {\n    $container: '#foo',\n    'bar': function ($foo) {\n      return $foo.find('#bar').text();\n    }\n  }\n});\n```\n\n### Constants\n\nStrings, numbers, and null values are simply used as the filter's value. This especially comes in handy for incrementally converting from mock data to parsed data.\n\n```javascript\nhtmlToJson.parse('\u003cdiv id=\"nada\"\u003e\u003c/div\u003e', {\n  x: 1,\n  y: 'string value',\n  z: null\n});\n```\n\n## Contributing\n\n### Running Tests\n\nTests are written in mocha and located in the `test` directory. Run them with:\n\n`npm test`\n\nThis script also executes `jshint` against `lib/` and `test/` directories.\n\n### Style\n\nPlease read the existing code in order to learn the conventions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprolificinteractive%2Fnode-html-to-json","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprolificinteractive%2Fnode-html-to-json","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprolificinteractive%2Fnode-html-to-json/lists"}