Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/0xcaffebabe/spider
一个JAVA爬虫项目
https://github.com/0xcaffebabe/spider
Last synced: about 1 month ago
JSON representation
一个JAVA爬虫项目
- Host: GitHub
- URL: https://github.com/0xcaffebabe/spider
- Owner: 0xcaffebabe
- Created: 2019-01-10T02:39:59.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2022-09-01T23:01:30.000Z (over 2 years ago)
- Last Synced: 2024-10-16T00:43:44.995Z (3 months ago)
- Language: Java
- Homepage:
- Size: 73.2 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Spider
2019-1-10
启动项目
#
注意事项:
项目用的JDK版本为11,并且使用了一些JDK11的新特性,运行低于JDK11可能会出现异常#
2019-1-15一个简单应用:
```java
private static Spider spider = new Spider();
private static Map map = new HashMap<>();
public static void main(String[] args) {
spider.setConnectionTimeOutEvent((spider1, request) -> {
System.out.println(request.getUrl()+"超时了");
});
Request request = new Request()
.url("http://dytt8.net/")
.method(RequestMethods.GET);
request(request);
}
public static void request(Request request){
spider.request(request,response -> {
response.toTextResponse("gb2312")
.css("td[style=WORD-WRAP: break-word] a")
.forEach(e->{
System.out.println(e.attr("href"));
});
response.toTextResponse("gb2312")
.css("a")
.forEach(e->{
String url = null;
if (!e.attr("href").startsWith("http://")){
url = "http://dytt8.net"+e.attr("href");
}else{
url = e.attr("href");
}
Request subRequest = new Request()
.url(url);
request(subRequest);
});
});
}
```![avatar](https://ismy1.oss-cn-qingdao.aliyuncs.com/gif8.gif)
#
扩展:可以实现该接口:
```java
public interface ResponseProcessChain {void process(Request request, Response response, Spider spider);
}
```
在Spider构造函数中进行注册:
```java
public Spider(){
responseProcessor.registerProcessChain(new WebNotFoundProcessChain());
responseProcessor.registerProcessChain(new MovedTemporarilyProcessChain());
}
```
当一个请求通过request完成之后,会根据注册顺序依次调用相关处理器,
使用者可以根据自身需要分别对request,response,spider等对象进行修改
2->
```java
spider.setConnectionTimeOutEvent((spider1, request) -> {
System.out.println(request.getUrl()+"超时了");
});
```可以向该函数传入一个实现了该接口的事件:
```java
public interface ConnectionTimeOutEvent {void onTimeOut(Spider spider, Request request);
}
```当请求超时,这个事件将会被调用
#
如果进行大量爬取操作,该段代码可能会成为性能瓶颈
```java
public URLConnection send(String url,Map headers) throws IOException {
URLConnection connection = new URL(url).openConnection();
for (String key : headers.keySet()){
connection.setRequestProperty(key,headers.get(key));
}
return connection;
}
```
后期会考虑连接复用,前期暂时不考虑性能问题
##