https://github.com/coding-dream/aspider
A spider run on Android Platform
https://github.com/coding-dream/aspider
crawler jsoup spider
Last synced: 8 months ago
JSON representation
A spider run on Android Platform
- Host: GitHub
- URL: https://github.com/coding-dream/aspider
- Owner: coding-dream
- Created: 2017-11-28T07:29:03.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2018-03-16T03:15:13.000Z (almost 8 years ago)
- Last Synced: 2025-03-01T07:27:35.244Z (12 months ago)
- Topics: crawler, jsoup, spider
- Language: Java
- Homepage:
- Size: 9.74 MB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ASpider
## 使用介绍
构建一个可运行在Android平台的爬虫框架,目前仍在完善。
特性:
1. 轻量级注解建表存储(类似Hibernate),c3p0连接池
2. 网页解析:Jsoup、Xpath、正则
3. 代理请求(普通代理池)
4. QueueScheduler、PriorityScheduler、BDBScheduler等方式的调度器
5. 支持GET、POST类型的爬虫
6. 请求参数伪造,逻辑稍复杂的爬虫可传递参数。
7. 内置下载器:HTTPConnection、Okhttp(可定制)
下面是一个爬取全网链接的基础爬虫案例:
```
public class SimpleSpider {
public static String TAG = SimpleSpider.class.getSimpleName();
public static final String REGEX_URL = "(http|https)+://[^\\s|\\?|&|'|\"]+(com|cn|org|net)+?";
public static final Pattern pattern = Pattern.compile(REGEX_URL);
public static void start(String startUrl) {
ASpider.create()
.pageProcessor(new PageProcessor() {
@Override
public void process(Page page) {
// 输出项(默认ConsolePipeline输出到终端)
page.putField("html", page.getRawText());
// 匹配网页链接
String newUrl = RegexUtils.get(REGEX_URL).selectSingle(html, 0);
// 添加新链接到队列
page.addTargetRequestsNoReferer(newUrl);
}
})
.thread(20)
.urls(startUrl)
.runAsync();
}
public static void main(String[] args) {
start("http://www.xx.com");
}
}
```
项目封装了一个方便存储到MySQL的工具类,只需继承自DatabaseHelper:
步骤一:编写实体Bean
```
@Table
public class JavaPdf {
@Column(value = "id", columnDefinition = "int primary key auto_increment")
private int id;
@Column
private String url;
@Column
private String title;
@Column(value = "panUrl", columnDefinition = "varchar(255) not null unique")
private String panUrl;
@Column(value = "passwd")
private String passwd;
public JavaPdf(String title, String panUrl, String passwd, String url) {
this.title = title;
this.panUrl = panUrl;
this.passwd = passwd;
this.url = url;
}
}
```
步骤二:数据库建表
```
JavaPdfSpider javaPdfSpider = new JavaPdfSpider();
javaPdfSpider.createTable(JavaPdf.class);
```
步骤三:存储数据
```
javaPdfSpider.insert(new JavaPdf(title, panUrl, passwd, page.getUrl()));
```
希望详细了解的请看源码,仅此粗略介绍!
## 协议
GPL(暂定)