{"id":15946159,"url":"https://github.com/eces/dom-collector","last_synced_at":"2025-04-03T22:26:16.070Z","repository":{"id":36099324,"uuid":"40400455","full_name":"eces/dom-collector","owner":"eces","description":"A simple DOM crawler based on JSON scheme.","archived":false,"fork":false,"pushed_at":"2017-08-18T05:56:42.000Z","size":14,"stargazers_count":0,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-12T10:43:12.070Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"CoffeeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eces.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-08-08T11:13:14.000Z","updated_at":"2017-01-29T05:48:22.000Z","dependencies_parsed_at":"2022-08-26T13:41:25.059Z","dependency_job_id":null,"html_url":"https://github.com/eces/dom-collector","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eces%2Fdom-collector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eces%2Fdom-collector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eces%2Fdom-collector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eces%2Fdom-collector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eces","download_url":"https://codeload.github.com/eces/dom-collector/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247087784,"owners_count":20881459,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-07T09:20:29.323Z","updated_at":"2025-04-03T22:26:16.053Z","avatar_url":"https://github.com/eces.png","language":"CoffeeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DOM Collector\n\n[![npm version](https://badge.fury.io/js/dom-collector.svg)](http://badge.fury.io/js/dom-collector)\n\nIt simply transforms a given url into key-value organized JSON with specification.\n\n\n### Install\n\n`npm install --save dom-collector`\n\n### Features\n\nUnder the hood, it does ...\n\n- Validate rule specification you passed.\n\n- Load web page with well-known library [request](https://github.com/request/request)\n\n- Parse and fetch elements with proved dom selector [cheerio](https://github.com/cheeriojs/cheerio); it might be better than jsdom.\n\n- Filter values and fill the default value configured.\n\n- Replace collected values into JSON Object, also iterative elements will be into JSON Array.\n\n- Return a thenable [Promise](https://github.com/petkaantonov/bluebird) function to be resolved asynchronously.\n\n### Example\n\nFor this html body\n\n```html\n\u003cul id=\"content-list\"\u003e\n  \u003cli data-id=\"1\"\u003e\n    \u003ca href=\"#\"\u003e aaa \u003c/a\u003e\n  \u003c/li\u003e\n  \u003cli data-id=\"2\"\u003e\n    \u003ca href=\"#\"\u003e bbb \u003c/a\u003e\n  \u003c/li\u003e\n  \u003cli data-id=\"3\"\u003e\n    \u003ca href=\"#\"\u003e\u003c/a\u003e\n  \u003c/li\u003e\n\u003c/ul\u003e\n```\n\nAdd a rule below\n\n```coffee\ncollector = require 'dom-collector'\n\nrule =\n  url: 'https://gist.githubusercontent.com/eces/f8d377992a12f64dc353/raw/75fd1607925e12bb82fdc7890514a3899781531d/test-01.html'\n  timeout: 15000\n  encoding: 'utf8'\n  params: []\n  headers: \n    'User-Agent': 'Mozilla/5.0(iPad; U; CPU iPhone OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B314 Safari/531.21.10'\n  selector: [\n    {\n      key: 'items[]'\n      value: '#content-list li'\n      type: 'array'\n      default: []\n    }\n    {\n      key: 'items[].label'\n      value: 'a'\n      type: 'string'\n      filter: 'trim'\n      default: 'default'\n    }\n    {\n      key: 'items[].src'\n      value: '[data-id]'\n      type: 'number'\n    }\n  ]\n\ntask = collector.fetch_json rule\ntask.then (result) -\u003e\n  console.log result\n```\n\nThen, it brings the result\n\n```json\n{\n  \"items\": [ \n    { \"label\": \"aaa\", \"src\": 1 }\n    { \"label\": \"bbb\", \"src\": 2 }\n    { \"label\": \"default\", \"src\": 3 }\n  ]\n}\n```\n\n### Functions\n\n#### `fetch_json(rule: Object)`\n\n\u003e ```\n\u003e require('dom-collector').fetch_json(rule);\n\u003e ```\n\n### Rule(selector) specification\n\n#### Value\n\nThis is DOM selector to find values for key. It supports querySelector and jQuery selector like. When you are supposed to do `$('#content')` then this value should be `#content`.\n\n#### Key\n\nThis key will be exposed and created into result JSON. If key has `[]` array notation, it becomes a parent key and every keys ending with `parent[]` become children of the parent. If parent key has no entry, children may not resolved from empty array.\n\n#### Type\n\n`string`, `number`, `boolean`\n\nPlease note that the default value will be set if failed type-casting.\n\n#### Default\n\nThis default value will be replaced into value if no element is found, and also\n\n  - when type is `string` and string length is zero.\n  - when type is `number` and falsy with `isFinite`; NaN, Infinity, undefined.\n\n#### Match\n\nThis regular expression will be evaluated and return the first value.\n\n`100` can be found from `\u003cli onclick=\"contentView(100, 3);\"\u003e\u003c/li\u003e` with below matcher:\n\n```coffee\nmatch: \"contentView\\\\(([0-9]+)\\\\,\"\n```\n\n\n#### Filter\n\nReference: eces/dom-collector/src/filter.coffee\n\n##### strip_filesize\n\n`70.5M` to `70500`\n \n##### strip_comma\n\n`1,000,000` to `1000000`\n\n##### trim\n\n`\"\\r\\n hello. \"` to `\"hello.\"`\n\n##### string\n\n`value` to `String(value)`\n\n##### number\n\n`value` to `Number(value)`\n\n##### boolean\n\n`value` to `Boolean(value)`\n\n##### Using custom function\n\nThe value is directly transformed by given function that is capable of any value also including `null`, `undefined`.\n\n```\nfilter: (v) -\u003e '(' + String(v).trim() + ')'\n```\n\nPlease be aware of unintended boolean conversion from this reading [MDN - Boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Boolean).\n\n\u003e The value passed as the first parameter is converted to a boolean value, if necessary. If value is omitted or is 0, -0, null, false, NaN, undefined, or the empty string (\"\"), the object has an initial value of false. All other values, including any object or the string \"false\", create an object with an initial value of true.\n\n\u003e Do not confuse the primitive Boolean values true and false with the true and false values of the Boolean object.\n\n\u003e Any object whose value is not undefined or null, including a Boolean object whose value is false, evaluates to true when passed to a conditional statement.\n\n### Development\n\n`grunt build`\n`grunt test`\n\n### Contribution\n\nWelcome\n\n\n### License\n\nUnder MIT License.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feces%2Fdom-collector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feces%2Fdom-collector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feces%2Fdom-collector/lists"}