Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hellokaton/elves
🎊 Design and implement of lightweight crawler framework.
https://github.com/hellokaton/elves
163news douban-movie elves scrapy spider
Last synced: 2 days ago
JSON representation
🎊 Design and implement of lightweight crawler framework.
- Host: GitHub
- URL: https://github.com/hellokaton/elves
- Owner: hellokaton
- License: mit
- Created: 2018-01-11T13:41:16.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2018-01-24T09:22:37.000Z (almost 7 years ago)
- Last Synced: 2025-01-08T12:06:51.180Z (10 days ago)
- Topics: 163news, douban-movie, elves, scrapy, spider
- Language: Java
- Homepage:
- Size: 544 KB
- Stars: 317
- Watchers: 23
- Forks: 86
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Elves
一个轻量级的爬虫框架设计与实现,[博文分析](https://blog.biezhi.me/2018/01/design-and-implement-a-crawler-framework.html)。
[![](https://img.shields.io/travis/biezhi/elves.svg)](https://travis-ci.org/biezhi/elves)
[![](https://img.shields.io/maven-central/v/io.github.biezhi/elves.svg)](https://mvnrepository.com/artifact/io.github.biezhi/elves)
[![@biezhi on zhihu](https://img.shields.io/badge/zhihu-%40biezhi-red.svg)](https://www.zhihu.com/people/biezhi)
[![](https://img.shields.io/badge/license-MIT-FF0080.svg)](https://github.com/biezhi/elves/blob/master/LICENSE)
[![](https://img.shields.io/github/followers/biezhi.svg?style=social&label=Follow%20Me)](https://github.com/biezhi)## 特性
- 事件驱动
- 易于定制
- 多线程执行
- `CSS` 选择器和 `XPath` 支持**Maven** 坐标
```xml
io.github.biezhi
elves
0.0.2```
如果你想在本地运行这个项目源码,请确保你是 `Java8` 环境并且安装了 [lombok](https://projectlombok.org/) 插件。
## 架构图
## 调用流程图
## 快速上手
搭建一个爬虫程序需要进行这么几步操作
1. 编写一个爬虫类继承自 `Spider`
2. 设置要抓取的 URL 列表
3. 实现 `Spider` 的 `parse` 方法
4. 添加 `Pipeline` 处理 `parse` 过滤后的数据举个栗子:
```java
public class DoubanSpider extends Spider {public DoubanSpider(String name) {
super(name);
this.startUrls(
"https://movie.douban.com/tag/爱情",
"https://movie.douban.com/tag/喜剧",
"https://movie.douban.com/tag/动画",
"https://movie.douban.com/tag/动作",
"https://movie.douban.com/tag/史诗",
"https://movie.douban.com/tag/犯罪");
}@Override
public void onStart(Config config) {
this.addPipeline((Pipeline>) (item, request) -> log.info("保存到文件: {}", item));
}public Result parse(Response response) {
Result> result = new Result<>();
Elements elements = response.body().css("#content table .pl2 a");List titles = elements.stream().map(Element::text).collect(Collectors.toList());
result.setItem(titles);// 获取下一页 URL
Elements nextEl = response.body().css("#content > div > div.article > div.paginator > span.next > a");
if (null != nextEl && nextEl.size() > 0) {
String nextPageUrl = nextEl.get(0).attr("href");
Request nextReq = this.makeRequest(nextPageUrl, this::parse);
result.addRequest(nextReq);
}
return result;
}}
public static void main(String[] args) {
DoubanSpider doubanSpider = new DoubanSpider("豆瓣电影");
Elves.me(doubanSpider, Config.me()).start();
}
```## 爬虫例子
- [豆瓣电影](https://github.com/biezhi/elves/blob/master/src/test/java/io/github/biezhi/elves/examples/DoubanExample.java)
- [网易新闻](https://github.com/biezhi/elves/blob/master/src/test/java/io/github/biezhi/elves/examples/News163Example.java)
- [糗事百科](https://github.com/biezhi/elves/blob/master/src/test/java/io/github/biezhi/elves/examples/QiubaiExample.java)
- [妹。。。妹子图](https://github.com/biezhi/elves/blob/master/src/test/java/io/github/biezhi/elves/examples/MeiziExample.java)## 开源协议
[MIT](https://github.com/biezhi/elves/blob/master/LICENSE)