https://github.com/zenoyang/webcrawler

crawler scrapy spider web-crawler

Last synced: 11 months ago
JSON representation

一些爬虫代码

README

# WebCrawler

## 100offer_crawler
100offer招聘信息采集

## caike_crawler
才客网职业信息采集

## Ganji_JN.py
爬取赶集网济南市租房信息地址：http://jn.ganji.com/fang1/

## Scrapy/xici
Scrapy爬取西刺的代理ip，并存储到mongodb，ip待验证 http://www.xicidaili.com/nn/

## Scrapy/zhihu
Scrapy爬取知乎所有用户信息，并存储到mongdb，封ip了，待解决

## Scrapy/doubanBook
Scrapy爬取豆瓣图书信息，保存为csv格式 https://book.douban.com/tag/%E5%8E%86%E5%8F%B2

## huaban
异步加载，爬取花瓣网美图 http://huaban.com/

## shixiseng
爬取实习僧Python实习工作信息并保存为xls格式 http://www.shixiseng.com/

## ss
利用爬虫科学上网 http://free.ishadow.online/ http://h6v6.com/

## 读写文档
csv、doc、pdf、txt格式的读写

## send_qq_email
用Python发送qq邮箱

## toutiao
分析Ajax爬取今日头条街拍图 http://www.toutiao.com/

## jupyter
jupyter的安装与启动

## craw_bin_tdp
爬取今年来robocup2d世界杯所有TDP与可执行 http://chaosscripting.net/files/competitions/RoboCup/WorldCup/

## meizitu
爬取妹子图所有图片 http://www.mzitu.com/

## baike_spider
爬取百度百科词条1000个 http://baike.baidu.com/view/21087.htm

## login_weibo_cn
登录新浪微博手机版 https://weibo.cn/login/

## 静谧
cookie的使用、urllib库的基本使用、URLError异常处理
爬取百度贴吧帖子、爬取糗事百科段子

## 爬虫隐藏
模拟真实浏览器访问网页的几种简单方法

## 翻译脚本
利用有道写的翻译脚本 http://fanyi.youdao.com/

## 使用proxy
使用和检验代理
http://www.whatismyip.com.tw
http://www.ip138.com
http://www.ip.cn/

## 数据库存储
链接到SQLServer、MySQL

## 图片的存储
图片的下载

## 网页下载器
urllib的使用