{"id":26879703,"url":"https://github.com/dhjz/spider","last_synced_at":"2026-04-28T08:33:14.734Z","repository":{"id":234411133,"uuid":"788841094","full_name":"dhjz/spider","owner":"dhjz","description":"纯前端js实现爬取页面内容,支持分页爬取,保存内容为json, 爬虫","archived":false,"fork":false,"pushed_at":"2024-05-06T03:57:24.000Z","size":16,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-05-06T04:40:23.800Z","etag":null,"topics":["json","spider","spider-js","vue-spider"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dhjz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-04-19T07:31:27.000Z","updated_at":"2024-05-06T03:57:27.000Z","dependencies_parsed_at":"2024-04-19T08:40:55.004Z","dependency_job_id":"eeafaa30-ef8a-41a7-97f8-e67805301731","html_url":"https://github.com/dhjz/spider","commit_stats":null,"previous_names":["dhjz/spider"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dhjz%2Fspider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dhjz%2Fspider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dhjz%2Fspider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dhjz%2Fspider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dhjz","download_url":"https://codeload.github.com/dhjz/spider/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246473858,"owners_count":20783359,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["json","spider","spider-js","vue-spider"],"created_at":"2025-03-31T13:31:41.571Z","updated_at":"2026-04-28T08:33:09.707Z","avatar_url":"https://github.com/dhjz.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"## 前言\n采集工具千千万，但大多数都会一定程度上依赖后端，为了简洁考虑，试着写了一个纯前端采集网站数据的工具。  \n\n## 思路\n- 采集列表页， 获取所有正文链接\n- 根据所有正文链接采集需要的内容\n- 将内容导出指定格式(本次是json)\n\n## 巧点\n采集网站可能会存在跨域的问题，下面是3种思路解决这个问题  \n1. 使用代理，这个比较简单，稍微懂点后端，用nodejs，go，java就能快速实现一个代理请求并返回html代码\n2. 使用谷歌浏览器跨域参数（本次采用这种方式），就完全不需要后端，纯前端跨域了\n3. 使用免费云函数，web端直接代理了\n```shell\n# 打开谷歌跨域参数.bat\n\"C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe\" --disable-web-security --user-data-dir=E:\\AllCache\\Chrome --allow-running-insecure-content %1\n```\n## 使用技术\n- vue3\n- axios\n\n## 功能一览\n\n\n## 实现步骤\n### 设计参数\n- 采集地址：采集url，通常为第一个列表页地址\n- 文章a标签选择器：列表页中需要采集正文页地址的a标签\n- 文章内正文选择器：正文页中需要采集的内容\n- 文章内标题选择器：正文页中需要采集的标题\n- 文章内日期选择器：正文页中需要采集的日期\n- 文章内容删除, |隔开：正文中需要删除的内容，比如一般固定的广告词，支持正则匹配删除\n- 保留标签：过滤采集的html各种标签，只保留指定的标签，比如p，img\n- 分页参数：支持分页，同时采集几十页\n- 采集延迟：每页的延迟，防止访问频率过高被封\n- 自定义内容页参数：内容页其他内容也可以采集，比如作者，点击量等\n### 采集列表页\n#### 采集单列表页\n- 根据指定的url和正文页a标签选择器，请求到地址的html内容后，筛选出所有正文标题和链接  \n- 注意需要将正文地址换成绝对地址,使用URL功能\n- 采集使用setTimeout延迟返回结果，可以自定义延迟\n```javascript\nfunction getListData(options) {\n  const result = { total: 0, list: [] }\n  return new Promise((reso) =\u003e {\n    let url = options.proxyUrl ? (options.proxyUrl + options.url) : options.url\n    axios.get(url, { responseType: 'document' }).then((res) =\u003e {\n      if (res.status != 200 || !res.data) return reso(result)\n      let articleList = Array.from(res.data.querySelectorAll(options.articlelSelector))\n\n      articleList.forEach(async (el) =\u003e {\n        const href = await el.getAttribute('href')\n        result.list.push({\n          href: /http[s]?:\\/\\//.test(href) ? href : new URL(href, options.url).toString(),\n          title: el.title ? el.title.trim() : el.innerText.trim()\n        })\n      })\n      result.total = articleList.length\n      console.log('getListData total -\u003e ' + result.total + ', url -\u003e ' + options.url)\n      setTimeout(() =\u003e reso(result), options.delay || 10)\n    }).catch(() =\u003e reso(result))\n  })\n}\n```\n#### 采集分页列表页数据\n- 使用for循环和await保证数据按照顺序采集\n- 提供回调函数反馈采集进度供预览\n```javascript\nfunction getPageListData(options, callback) {\n  return new Promise(async (reso) =\u003e {\n    let result = { total: 0, list: [] }\n    if (!options.pageUrl) return reso(result)\n    if (options.list \u0026\u0026 options.list.length) return reso({ total: options.list.length, list: options.list })\n\n    if (options.isPage != 1) {\n      return reso(await getListData({ ...options }))\n    }\n\n    const pageStart = parseInt(options.pageStart)\n    const pageEnd = parseInt(options.pageEnd)\n    const start = Math.min(pageStart, pageEnd)\n    const end = Math.max(pageStart, pageEnd)\n    for (let i = start - 1; i \u003c= end; i++) {\n      if (i != start - 1) options.url = options.pageUrl.replace('{ID}', i)\n      const data = await getListData({ ...options })\n      result.total += data.total\n      result.list = result.list.concat(data.list)\n      callback \u0026\u0026 callback(i, end)\n    }\n    reso(result)\n  })\n}\n```\n### 采集正文页内容\n- 将正文所有href、src的链接都替换为绝对路径\n- 删除所有script标签，因为无用\n- 根据提供的标题、日期、自定义字段参数将内容了取出返回\n- 分页采集正文思路同`采集分页列表`\n```javascript\nfunction getContData(options) {\n  return new Promise((reso) =\u003e {\n    let url = options.proxyUrl ? (options.proxyUrl + options.href) : options.href\n    axios.get(url, { responseType: 'document' }).then(async (res) =\u003e {\n      if (res.status != 200 || !res.data) return reso({})\n      let contEl = res.data.querySelector(options.contSelector)\n      if (!contEl) return reso({})\n      Array.from(contEl.querySelectorAll('*[href]')).forEach(async el =\u003e {\n        const href = await el.getAttribute('href')\n        el.href = /http[s]?:\\/\\//.test(href) ? href : new URL(href, options.url).toString()\n      })\n      Array.from(contEl.querySelectorAll('*[src]')).forEach(async el =\u003e {\n        const src = await el.getAttribute('src')\n        el.src = /http[s]?:\\/\\//.test(src) ? src : new URL(src, options.url).toString()\n      })\n      Array.from(contEl.querySelectorAll('script')).forEach(el =\u003e {\n        el.parentNode.removeChild(el) // 删除所有script标签\n      })\n      setTimeout(() =\u003e {\n        let cont = options.isText == '1' ? pureHtml(contEl.innerHTML)\n          : pureHtml(contEl.innerHTML, options.tags ? options.tags.trim().replaceAll('，',',').split(',') : null)\n        if (options.contDel) {\n          options.contDel.split('|').forEach(item =\u003e cont = cont.replace(new RegExp(item, 'ig'), ''))\n        }\n        const title = options.titleSelector ? res.data.querySelector(options.titleSelector)?.innerText : null\n        const date = options.dateSelector ? res.data.querySelector(options.dateSelector)?.innerText : null\n        const result = { title, cont, date }\n        if (options.fields \u0026\u0026 options.fields.length) {\n          options.fields.filter(x =\u003e x.key \u0026\u0026 x.val).forEach(x =\u003e {\n            result[x.key] = res.data.querySelector(x.val)?.innerText || ''\n          })\n        }\n        reso(result)\n      }, options.delay || 10)\n    }).catch(() =\u003e reso(''))\n  })\n}\n```\n### 将数据保存到json文件\n- CV了网上常见的文件保存代码\n- 支持保存亚索版json和美化版json\n```javascript\n// 采集中的代码\ngetPageContData({ ...this.form }, this.updateProcess).then(data =\u003e {\n  console.log(data)\n  saveText(JSON.stringify(data, null, isBeauty ? 2 : 0), 'test.json')\n})\n\n// 工具函数\nwindow.saveAs = function(blob, name) {\n    let link = document.createElement('a')\n    let href = window.URL.createObjectURL(blob)\n    link.href = href\n    link.download = name\n    document.body.appendChild(link)\n    link.click()\n    document.body.removeChild(link)\n    window.URL.revokeObjectURL(href)\n}\n\nwindow.saveText = function(text, name) {\n  window.saveAs(new Blob([text], {type: \"text/plain;charset=utf-8\"}), name || (new Date().getTime() + '.txt'))\n}\n// 进化html标签工具函数....\n```\n\n## 后言\n开发这个主要是怕一些网站随时关停了，数据就没了，好留个备份，仅供学习使用\n\n## 在线体验地址\n[https://dhjz.github.io/spider](https://dhjz.github.io/spider)  \n- 源码详见[https://github.com/dhjz/spider](https://github.com/dhjz/spider)\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdhjz%2Fspider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdhjz%2Fspider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdhjz%2Fspider/lists"}