{"id":13746782,"url":"https://github.com/ramsayleung/jd_spider","last_synced_at":"2025-09-28T21:14:15.546Z","repository":{"id":92957104,"uuid":"94893536","full_name":"ramsayleung/jd_spider","owner":"ramsayleung","description":"Two dumb distributed crawlers","archived":false,"fork":false,"pushed_at":"2019-04-08T06:14:08.000Z","size":1619,"stargazers_count":725,"open_issues_count":2,"forks_count":208,"subscribers_count":44,"default_branch":"master","last_synced_at":"2024-08-03T09:05:55.750Z","etag":null,"topics":["docker","graphite","mongodb","python3","scrapy"],"latest_commit_sha":null,"homepage":"https://ramsayleung.github.io/zh/post/2017/jd_spider/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ramsayleung.png","metadata":{"files":{"readme":"README.org","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-06-20T13:21:01.000Z","updated_at":"2024-07-18T07:26:51.000Z","dependencies_parsed_at":"2023-04-12T23:16:30.578Z","dependency_job_id":null,"html_url":"https://github.com/ramsayleung/jd_spider","commit_stats":null,"previous_names":["samrayleung/jd_spider"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ramsayleung%2Fjd_spider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ramsayleung%2Fjd_spider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ramsayleung%2Fjd_spider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ramsayleung%2Fjd_spider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ramsayleung","download_url":"https://codeload.github.com/ramsayleung/jd_spider/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224842448,"owners_count":17378977,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","graphite","mongodb","python3","scrapy"],"created_at":"2024-08-03T06:01:01.342Z","updated_at":"2025-09-28T21:14:10.479Z","avatar_url":"https://github.com/ramsayleung.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\n* 公告\n因为京东反爬策略的更新，该repo的爬虫有可能已经无法爬取内容，兼之这个爬虫是本人在大三时候编写的，时隔两年多，本人已经工作，没有时间和精力继续更新反反爬策略，遂放弃维护。\n* 概述\n使用 scrapy, scrapy-redis, graphite 实现的京东分布式爬虫，以 mongodb 实现底层存储。分布式\n实现，解决带宽和性能的瓶颈，提高爬取的效率。实现 scrapy-redis 对进行 url 的去重\n以及调度，利用redis的高效和易于扩展能够轻松实现高效率下载：当redis存储或者访问速\n度遇到瓶颈时，可以通过增大redis集群数和爬虫集群数量改善\n* 版本支持 \n  现在支持Py2 和 Py3, 但是需要注意的是，为了兼容Py2, 默认不开启Graphite, 如果需要开启的话，需要Py3 并且修改 settings.py 的 ~ENABLE_GRAPHITE~ 字段，默认为False\n* 爬取策略\n  获取 ~\u003ca href\u003e~ 标签里面的 url 值，然后迭代爬取，并且把 url 限定在~xxx.jd.com~\n  范围内，防止无限广度的问题。在爬取某个页面的商品的时候，会把同一个商品的不同\n  规格爬取下来，例如32GIPhone,64GIPhone, 126GIPhone 等。\n* 请求去重策略\n  使用 `scrapy_redis.dupefilter.RFPDupeFilter` 实现去重，请求入队列的逻辑－\n  [[https://github.com/rmax/scrapy-redis/blob/31c022dd145654cb4ea1429f09852a82afa0a01c/src/scrapy_redis/scheduler.py#L153][enqueue_request]],\n  而具体的去重逻辑是调用\n  [[https://github.com/scrapy/scrapy/blob/acd2b8d43b5ebec7ffd364b6f335427041a0b98d/scrapy/utils/request.py#L19][scrapy.utils.request.request.fingerprint]]\n* 商品去重策略\n  使用 Redis 进行商品去重，将商品的 sku-id 放入Redis, 在将整个商品数据插入到\n  Mongodb 之前，先检查 Redis 里sku-id 是否已存在\n* 反反爬虫策略\n** 禁用 cookie\n   通过禁用 cookie, 服务器就无法根据 cookie 判断出爬虫是否访问过网站\n** 伪装成搜索引擎\n   现在可以通过修改 user-agent 伪装成搜索引擎\n   #+BEGIN_SRC \n    'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',\n    'Mozilla/5.0 (compatible; Bingbot/2.0; +http://www.bing.com/bingbot.htm)',\n    'Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)',\n    'DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)',\n    'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)',\n    'Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)',\n    'ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)',\n   #+END_SRC\n** 轮转 user-agent\n   为了提高突破反爬虫策略的成功率，定义多个user-agent, 然后每次请求都随机选择\n   user-agent。本爬虫实现了一个 ~RotateUserAgentMiddleware~ 类来实现 user-agent\n   的轮转\n** 代理 IP\n   使用代理 IP, 防止 IP 被封\n* 爬虫状态监控\n  将爬虫stats信息(请求个数，item下载个数，dropItem个数，日志)保存到redis中\n  实现了一个针对分布式的stats collector，并将其结果用graphite以图表形式动态实时显示\n* 并发请求和深度控制\n  通过 ~setting.py~ 中的 ~CONCURRENT_REQUESTS = 32~ 配置来控制并发请求数量，通过\n  ~DepthMiddle~ 类的 ~DEPTH_LIMIT=max~ 参数来控制爬虫的的递归深度\n* 项目依赖\n  + python 3.5+\n  + scrapy\n  + scrapy-redis\n  + pymongo\n  + graphite (可选)\n* 如何运行\n  #+BEGIN_SRC  sh\n    git clone  https://github.com/samrayleung/jd_spider.git \n  #+END_SRC\n  然后安装 python依赖\n  #+BEGIN_SRC sh\n    (sudo) pip install -r requirements.txt\n  #+END_SRC\n** 安装Graphite(可选)\n*** docker 安装\n    安装配置 graphite. 需要注意的是 graphite 只适用于 Linux 平台，且安装过程非常\n    麻烦，所以强烈建议使用 docker 进行安装。我基于 [[https://github.com/hopsoft/docker-graphite-statsd][docker-graphite-statsd]] 这个\n    graphite 的镜像作了些许配置文件的修改，以适配 scrapy. 运行以下命令以拉取并运\n    行 image\n    #+BEGIN_SRC sh\n      sudo docker run -d\\\n\t   --name graphite\\\n\t   --restart=always\\\n\t   -p 80:80\\\n\t   -p 2003-2004:2003-2004\\\n\t   -p 2023-2024:2023-2024\\\n\t   -p 8125:8125/udp\\\n\t   -p 8126:8126\\\n\t   samrayleung/graphite-statsd\n    #+END_SRC\n    然后就可以在浏览器打开：\n    [[http://localhost/dashboard][dashboard]]\n    或者是登录到管理界面：\n    [[http://localhost/account/login]]\n    默认帐号密码是：\n    + username: root\n    + password: root\n*** 手动安装\n    当然，你也可以自己配置 graphite, 在成功配置 graphite 之后，需要修改一些配置：\n    + 把 ~/opt/graphite/webapp/content/js/composer_widgets.js~ 文件中\n      ~toggleAutoRefresh~ 函数里的 ~interval~ 变量从60改为1。\n    + 在配置文件 ~storage-aggregation.conf~ 里添加：\n      #+BEGIN_SRC \n      [scrapy_min]\n     pattern = ^scrapy\\..*_min$\n     xFilesFactor = 0.1\n     aggregationMethod = min\n     [scrapy_max]\n     pattern = ^scrapy\\..*_max$\n     xFilesFactor = 0.1\n     aggregationMethod = max\n     [scrapy_sum]\n     pattern = ^scrapy\\..*_count$\n     xFilesFactor = 0.1\n     aggregationMethod = sum\n      #+END_SRC\n      而 ~storage-aggregation.conf~ 这个配置文件一般是位于 ~/opt/graphite/conf~\n** 运行\n    一切准备就绪之后，就可以运行爬虫了。\n    进入到 ~jd~ 目录下：\n    #+BEGIN_SRC sh\n      scrapy crawl jindong\n    #+END_SRC\n** 注意事项\n   需要注意的是，本项目是含有两只爬虫，爬取商品评论需要先爬取商品信息，因为有了\n   商品信息才能爬取评论\n** 代理 IP\n   虽然不使用代理 IP 可以爬取商品信息，但是可能爬取一段时间后就无法爬取商品信息，\n   所以需要添加代理 IP. 以 http://ip:port 的形式保存到文本文件，每行一个 IP,然后\n   在 ~setting~ 中指定路径：\n   #+BEGIN_SRC python\n     PROXY_LIST = 'path/to/proxy_ip.txt'\n   #+END_SRC\n   并且去掉下面配置的注释：\n   #+BEGIN_SRC python\n     RETRY_TIMES = 10\n     RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]\n\n     DOWNLOADER_MIDDLEWARES = {\n\t 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,\n\t 'scrapy_proxies.RandomProxy': 100,\n\t 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,\n     }\n     PROXY_MODE = 0\n   #+END_SRC\n   \n* 运行截图\n** graphite 监控\n\n   [[./images/jd_comment_graphite1.png]]\n   \n   [[./images/jd_comment_graphite2.png]]\n** 评论\n   [[./images/jd_comment.png]]\n** 评论总结\n   [[./images/jd_comment_summary.png]]\n** 商品信息\n   [[./images/jd_parameters.png]]\n** Todo\n** Done 优化商品去重策略\n   CLOSED: [2018-03-09 Fri 21:16]\n   Issue:解决 [[https://github.com/samrayleung/jd_spider/issues/6][爬取重复商品]]\n** Todo 优化爬取策略\n** Todo 增加新的解析策略\n   Issue: 解决 [[https://github.com/samrayleung/jd_spider/issues/10][parse book item error]]\n* ChangeLog\n** 2018-9-30\n    + 新增 Pipenv 支持\n    + 增加 py2 支持\n    + 默认不开启 Graphite\n    + 将爬虫修改回继承 ~RedisSpider~\n    + 修复Github 提示的可能存在漏洞的包\n    + 感觉JD 的反爬虫策略明显加强，尝试爬了一会，很快被封IP\n    + 这个应该最后一次Update, 不会再投入精力到这个爬虫项目了\n** 2018-4-4\n   + 将 Graphite 修改为可选项\n* 参考及致谢\n  + [[https://github.com/noplay/scrapy-graphite]]\n  + [[https://github.com/gnemoug/distribute_crawler]]\n  + https://github.com/hopsoft/docker-graphite-statsd\n  + [[https://github.com/aivarsk/scrapy-proxies]]\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Framsayleung%2Fjd_spider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Framsayleung%2Fjd_spider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Framsayleung%2Fjd_spider/lists"}