{"id":23246328,"url":"https://github.com/tvrcgo/collect","last_synced_at":"2025-04-06T00:21:36.471Z","repository":{"id":145688230,"uuid":"46268145","full_name":"tvrcgo/collect","owner":"tvrcgo","description":"数据采集","archived":false,"fork":false,"pushed_at":"2015-12-31T03:46:53.000Z","size":29,"stargazers_count":2,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-02-12T06:31:22.519Z","etag":null,"topics":["crawler","scraper"],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tvrcgo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-11-16T10:34:29.000Z","updated_at":"2023-02-02T20:40:47.000Z","dependencies_parsed_at":"2023-04-17T09:36:09.330Z","dependency_job_id":null,"html_url":"https://github.com/tvrcgo/collect","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tvrcgo%2Fcollect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tvrcgo%2Fcollect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tvrcgo%2Fcollect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tvrcgo%2Fcollect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tvrcgo","download_url":"https://codeload.github.com/tvrcgo/collect/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247417558,"owners_count":20935669,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","scraper"],"created_at":"2024-12-19T07:14:04.497Z","updated_at":"2025-04-06T00:21:36.439Z","avatar_url":"https://github.com/tvrcgo.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# collect\n数据采集\n\n## Install\n```sh\nnpm i tvrcgo/collect\n```\n\n## Usage\n\n#### 采集数据\n流式处理，自由添加处理函数\n```js\nvar collect = require('collect');\n\ncollect.src('http://example.com')\n    .use(function(data, next){\n        // 网页内容在 data.content\n        // process data -\u003e data2\n        next(data2);\n    })\n    .use(function(data2, next){\n        // process data2 -\u003e data3\n        next(data3);\n    })\n```\n\n#### 选择元素\n支持索引值、属性、html内容，默认取标签text\n- `[]` 索引\n- `@` 属性\n- `:html` html内容\n\n```js\ncollect.src('http://example.com')\n    .query({\n        li: '.list li[1]', // 按索引找单个元素\n        imgs: '.list li img@src', // 多个img的src属性值\n        cols: ['.list li', 'img@src, a@href, a'], // 多个li元素下面多个元素的值\n        html: '.list li:html' // html内容\n    })\n    .use(function(data, next){\n        // data.li\n        // data.imgs\n        // data.cols\n        // data.html\n    })\n```\n\n#### 指定 User-Agent 和代理\n```js\ncollect.src('http://example.com', {\n    userAgent: \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 ...\",\n    proxy: \"http://191.26.14.23:8000\"\n})\n.use(function(data, next){\n    // process data\n    next(data);\n})\n```\n\n#### 采集 Ajax 页面内容\n- `javascript` 为 true 允许页面执行JS\n- `delay` 页面最后一次发出或收到请求后，在 delay 时间内再无动作，认为 ajax 加载已经完成\n- `timeout` 页面加载超时\n\n```js\ncollect.src('http://example.com', {\n        javascript: true,\n        delay: 1000*5,\n        timeout: 1000*15\n    })\n    .use(function(data, next){\n        // process data\n    })\n```\n\n#### 输出\n将处理后的数据输出到文件\n```js\ncollect.src('http://example.com')\n    .use(function(data, next){\n        // process data\n        next(data);\n    })\n    .dest('body.csv');\n```\n\n输出到其它流\n```js\ncollect.src('http://example.com')\n    .use(function(data, next){\n        // process data\n        next(data);\n    })\n    .pipe(stream);\n```\n\n## License\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftvrcgo%2Fcollect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftvrcgo%2Fcollect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftvrcgo%2Fcollect/lists"}