{"id":20113725,"url":"https://github.com/brantou/crawler","last_synced_at":"2025-05-06T12:30:36.746Z","repository":{"id":56780523,"uuid":"98726813","full_name":"brantou/crawler","owner":"brantou","description":"爬虫, http代理, 模拟登陆!","archived":false,"fork":false,"pushed_at":"2017-09-19T13:45:10.000Z","size":110,"stargazers_count":108,"open_issues_count":1,"forks_count":42,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-09T12:21:51.173Z","etag":null,"topics":["crawler","python","scrapy"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brantou.png","metadata":{"files":{"readme":"README.org","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-07-29T10:57:46.000Z","updated_at":"2025-03-31T07:33:01.000Z","dependencies_parsed_at":"2022-08-16T02:50:12.398Z","dependency_job_id":null,"html_url":"https://github.com/brantou/crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brantou%2Fcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brantou%2Fcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brantou%2Fcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brantou%2Fcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brantou","download_url":"https://codeload.github.com/brantou/crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252683379,"owners_count":21788028,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","python","scrapy"],"created_at":"2024-11-13T18:25:34.790Z","updated_at":"2025-05-06T12:30:36.364Z","avatar_url":"https://github.com/brantou.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"#+TITLE: Crawler\n\n* 爬虫集\n  :PROPERTIES:\n  :ID:       aef07119-226a-4c8a-b5db-bad3bd9372a2\n  :END:\n  互联网招聘网址爬虫如下：\n  + [[https://github.com/brantou/crawler/blob/master/jobs/jobs/spiders/lagou.py][拉勾]]\n  + [[https://github.com/brantou/crawler/blob/master/jobs/jobs/spiders/zhipin.py][boss直聘]]\n  + [[https://github.com/brantou/crawler/blob/master/jobs/jobs/spiders/liepin.py][猎聘]]\n  + [[https://github.com/brantou/crawler/blob/master/jobs/jobs/spiders/neitui.py][内推]]\n  + [[https://github.com/brantou/crawler/blob/master/jobs/jobs/spiders/a100offer.py][100offer]]\n\n  互联网知名公司招聘信息爬虫如下：\n  + [[https://github.com/brantou/crawler/blob/master/jobs/jobs/spiders/alibaba.py][阿里巴巴]]\n  + [[https://github.com/brantou/crawler/blob/master/jobs/jobs/spiders/baidu.py][百度]]\n  + [[https://github.com/brantou/crawler/blob/master/jobs/jobs/spiders/meituan.py][美团]]\n  + [[https://github.com/brantou/crawler/blob/master/jobs/jobs/spiders/didi.py][滴滴出行]]\n\n  内容服务商爬虫:\n  + [[https://github.com/brantou/crawler/blob/master/jobs/jobs/spiders/zhihu.py][知乎]]\n\n* 爬虫脚手架\n  :PROPERTIES:\n  :ID:       81f440f1-d59b-43f6-ad35-049f8fd5a984\n  :END:\n** pipeline\n   :PROPERTIES:\n   :ID:       2a53dd96-b2a6-4ed4-832b-b18a19715587\n   :END:\n  目前只有两个 *pipeline* , 一个使用mongo做数据存储，一个使用set做数据的判重, 点击[[https://github.com/brantou/crawler/blob/master/jobs/jobs/pipelines.py][查看源码]]。\n\n** middleware\n   :PROPERTIES:\n   :ID:       d6986286-b0b1-4374-b5ba-40ff87f30722\n   :END:\n  目前只有两个 *middleware* ，一个使用 [[https://pypi.python.org/pypi/fake-useragent][fake_useragent]] 来生成随机UA，一个用于使用http代理列表, 点击[[https://github.com/brantou/crawler/blob/master/jobs/jobs/middlewares.py][查看源码]]。\n\n* 工具集\n  :PROPERTIES:\n  :ID:       36d63ee1-ce84-47cd-8358-3e2e56e2739d\n  :END:\n** 抓取免费代理\n   :PROPERTIES:\n   :ID:       eea5f4a1-c787-4e69-b444-1d8728f0bf1c\n   :END:\n   抓取代理网站中给出的免费代理, 并初步校验,点击[[https://github.com/brantou/crawler/blob/master/utils/free_proxy.py][查看源码]]！\n   目前抓取的代理网站如下：\n   + [[http://www.kxdaili.com/dailiip.html][开心代理]]\n   + [[http://www.kxdaili.com/dailiip.html][米扑代理]]\n   + [[http://www.kxdaili.com/dailiip.html][西刺代理]]\n   + [[http://www.ip181.com/daili/1.html][ip181]]\n   + [[http://www.httpdaili.com/mfdl/][httpdaili]]\n   + [[http://www.66ip.cn/index.html][66ip]]\n   + [[http://www.data5u.com/][无忧代理]]\n   + [[http://www.kuaidaili.com/free/][快代理]]\n   + [[http://www.ip002.net/free.html][ip002]]\n\n** 代理验证\n   :PROPERTIES:\n   :ID:       a64313fa-985b-41e1-8f3a-33a37d99cd76\n   :END:\n   使用 [[https://httpbin.org/][httpbin]] 来测验代理的时效性和种类。\n\n** IP信息获取\n   :PROPERTIES:\n   :ID:       309ed608-69c2-4cb6-bff2-f489711fbdbc\n   :END:\n   使用 [[http://api.geoiplookup.net/][geoiplookup]] 用于查询IP信息。\n\n   示例如下:\n   #+BEGIN_SRC python :session ip-info :results output pp :exports both\n     from utils.ip_info import get_ip_info\n\n     print(get_ip_info('8.8.8.8'))\n   #+END_SRC\n\n   #+RESULTS:\n   : {u'countrycode': u'US', u'ip': u'8.8.8.8', u'isp': u'Google', u'longitude': u'-97.822', u'countryname': u'United States', u'host': u'8.8.8.8', u'latitude': u'37.751'}\n\n** 翻译函数\n   :PROPERTIES:\n   :ID:       81779fb7-c9a7-4be6-b34b-0be8bb03216c\n   :END:\n   目前只做了简单封装，支持如下：\n   + 有道词典\n     #+BEGIN_SRC python :session translate :results output pp :exports both\n       from utils.translate import translate\n       import json\n\n       print(translate(u'努力工作', dict_name='youdao')['translateResult'][0][0]['tgt'])\n       print(translate(u'hard work', dict_name='youdao', lfrom='en', lto='zh-CHS')['translateResult'][0][0]['tgt'])\n     #+END_SRC\n\n     #+RESULTS:\n     : To work hard\n     : 努力工作\n\n   + 百度翻译\n     #+BEGIN_SRC python :session translate :results output pp :exports both\n       from utils.translate import translate\n\n       print(translate(u'努力工作', dict_name='baidu')[0]['dst'])\n       print(translate(u'hard work', dict_name='baidu', lfrom='en', lto='zh-CHS')[0]['dst'])\n     #+END_SRC\n\n     #+RESULTS:\n     : Work hard\n     : 艰苦的工作\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrantou%2Fcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrantou%2Fcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrantou%2Fcrawler/lists"}