Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bpazy/dbspider
豆瓣爬虫
https://github.com/bpazy/dbspider
douban douban-movie java spider
Last synced: 17 days ago
JSON representation
豆瓣爬虫
- Host: GitHub
- URL: https://github.com/bpazy/dbspider
- Owner: Bpazy
- Created: 2016-12-05T16:51:34.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2016-12-07T05:29:44.000Z (about 8 years ago)
- Last Synced: 2024-10-30T01:39:05.349Z (2 months ago)
- Topics: douban, douban-movie, java, spider
- Language: Java
- Homepage:
- Size: 72.3 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
dbSpider(豆瓣爬虫)
================
使用了redis作为缓存,可实现集群爬取数据。爬取的结果存在mysql中。
### 豆瓣爬虫原理
```
每次请求附带不同的Cookie, Cookie的内容为"bid=[随机的11位长度字符串]"
例如:
HttpRequest
.get("https://book.douban.com/subject/26864983/")
.header("Cookie", "bid=aaaaaaaaaaa")
.body();
```### 开发教程
```
在main方法中执行:
Spider spider = new Spider("https://movie.douban.com/subject/26683290/") {
@Override
protected SpiderCore spiderCore(String target, QueueAndRedis queue) {
return new MovieSpiderCore(target, queue);
}
};
spider.start();
```
其中MovieSpiderCore继承自SpiderCore,实现其中的关键方法:
```
/**
* 相关的URL
*
* @return 选择器,例如: "div#recommendations div dl dt a" 电影页面的相关电影
*/
protected abstract String relatedUrlSelect();/**
* 用于判断model是否合法
*
* @param model model
* @return 合法返回true
*/
protected abstract boolean isValid(T model);/**
* 根据Document对象生成model
*
* @param doc Document对象
* @return model对象
*/
protected abstract T getModel(Document doc);
```