Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bpazy/dbspider

豆瓣爬虫
https://github.com/bpazy/dbspider

douban douban-movie java spider

Last synced: 17 days ago
JSON representation

豆瓣爬虫

Host: GitHub
URL: https://github.com/bpazy/dbspider
Owner: Bpazy
Created: 2016-12-05T16:51:34.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2016-12-07T05:29:44.000Z (about 8 years ago)
Last Synced: 2024-10-30T01:39:05.349Z (2 months ago)
Topics: douban, douban-movie, java, spider
Language: Java
Homepage:
Size: 72.3 KB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        dbSpider(豆瓣爬虫)

================

使用了redis作为缓存，可实现集群爬取数据。

爬取的结果存在mysql中。

### 豆瓣爬虫原理

```

每次请求附带不同的Cookie, Cookie的内容为"bid=[随机的11位长度字符串]"

例如：

HttpRequest

        .get("https://book.douban.com/subject/26864983/")

        .header("Cookie", "bid=aaaaaaaaaaa")

        .body();

```

### 开发教程

```

在main方法中执行：

Spider spider = new Spider("https://movie.douban.com/subject/26683290/") {

    @Override

    protected SpiderCore spiderCore(String target, QueueAndRedis queue) {

        return new MovieSpiderCore(target, queue);

    }

};

spider.start();

```

其中MovieSpiderCore继承自SpiderCore，实现其中的关键方法：

```

/**

 * 相关的URL

 *

 * @return 选择器，例如: "div#recommendations div dl dt a" 电影页面的相关电影

 */

protected abstract String relatedUrlSelect();

/**

 * 用于判断model是否合法

 *

 * @param model model

 * @return 合法返回true

 */

protected abstract boolean isValid(T model);

/**

 * 根据Document对象生成model

 *

 * @param doc Document对象

 * @return model对象

 */

protected abstract T getModel(Document doc);

```