Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hellokaton/elves

🎊 Design and implement of lightweight crawler framework.
https://github.com/hellokaton/elves

163news douban-movie elves scrapy spider

Last synced: 2 days ago
JSON representation

🎊 Design and implement of lightweight crawler framework.

Awesome Lists containing this project

README

        

# Elves

一个轻量级的爬虫框架设计与实现,[博文分析](https://blog.biezhi.me/2018/01/design-and-implement-a-crawler-framework.html)。

[![](https://img.shields.io/travis/biezhi/elves.svg)](https://travis-ci.org/biezhi/elves)
[![](https://img.shields.io/maven-central/v/io.github.biezhi/elves.svg)](https://mvnrepository.com/artifact/io.github.biezhi/elves)
[![@biezhi on zhihu](https://img.shields.io/badge/zhihu-%40biezhi-red.svg)](https://www.zhihu.com/people/biezhi)
[![](https://img.shields.io/badge/license-MIT-FF0080.svg)](https://github.com/biezhi/elves/blob/master/LICENSE)
[![](https://img.shields.io/github/followers/biezhi.svg?style=social&label=Follow%20Me)](https://github.com/biezhi)

## 特性

- 事件驱动
- 易于定制
- 多线程执行
- `CSS` 选择器和 `XPath` 支持

**Maven** 坐标

```xml

io.github.biezhi
elves
0.0.2

```

如果你想在本地运行这个项目源码,请确保你是 `Java8` 环境并且安装了 [lombok](https://projectlombok.org/) 插件。

## 架构图

## 调用流程图

## 快速上手

搭建一个爬虫程序需要进行这么几步操作

1. 编写一个爬虫类继承自 `Spider`
2. 设置要抓取的 URL 列表
3. 实现 `Spider` 的 `parse` 方法
4. 添加 `Pipeline` 处理 `parse` 过滤后的数据

举个栗子:

```java
public class DoubanSpider extends Spider {

public DoubanSpider(String name) {
super(name);
this.startUrls(
"https://movie.douban.com/tag/爱情",
"https://movie.douban.com/tag/喜剧",
"https://movie.douban.com/tag/动画",
"https://movie.douban.com/tag/动作",
"https://movie.douban.com/tag/史诗",
"https://movie.douban.com/tag/犯罪");
}

@Override
public void onStart(Config config) {
this.addPipeline((Pipeline>) (item, request) -> log.info("保存到文件: {}", item));
}

public Result parse(Response response) {
Result> result = new Result<>();
Elements elements = response.body().css("#content table .pl2 a");

List titles = elements.stream().map(Element::text).collect(Collectors.toList());
result.setItem(titles);

// 获取下一页 URL
Elements nextEl = response.body().css("#content > div > div.article > div.paginator > span.next > a");
if (null != nextEl && nextEl.size() > 0) {
String nextPageUrl = nextEl.get(0).attr("href");
Request nextReq = this.makeRequest(nextPageUrl, this::parse);
result.addRequest(nextReq);
}
return result;
}

}

public static void main(String[] args) {
DoubanSpider doubanSpider = new DoubanSpider("豆瓣电影");
Elves.me(doubanSpider, Config.me()).start();
}
```

## 爬虫例子

- [豆瓣电影](https://github.com/biezhi/elves/blob/master/src/test/java/io/github/biezhi/elves/examples/DoubanExample.java)
- [网易新闻](https://github.com/biezhi/elves/blob/master/src/test/java/io/github/biezhi/elves/examples/News163Example.java)
- [糗事百科](https://github.com/biezhi/elves/blob/master/src/test/java/io/github/biezhi/elves/examples/QiubaiExample.java)
- [妹。。。妹子图](https://github.com/biezhi/elves/blob/master/src/test/java/io/github/biezhi/elves/examples/MeiziExample.java)

## 开源协议

[MIT](https://github.com/biezhi/elves/blob/master/LICENSE)