{"id":15640635,"url":"https://github.com/onblog/aipa","last_synced_at":"2025-04-30T08:14:17.898Z","repository":{"id":43261720,"uuid":"150560134","full_name":"onblog/AiPa","owner":"onblog","description":"A compact, flexible Java multi-threaded crawler framework (Ai Pa), built-in Jsoup, zero-cost hands-on.一款小巧、灵活的Java多线程爬虫框架（AiPa）内嵌Jsoup 零成本上手（欢迎Star，🚫禁止Fork）","archived":false,"fork":false,"pushed_at":"2022-09-01T23:03:48.000Z","size":37,"stargazers_count":78,"open_issues_count":2,"forks_count":24,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-30T08:14:12.312Z","etag":null,"topics":["java-8","jsoup"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/onblog.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-09-27T09:12:54.000Z","updated_at":"2024-08-31T01:35:18.000Z","dependencies_parsed_at":"2022-07-15T00:45:58.295Z","dependency_job_id":null,"html_url":"https://github.com/onblog/AiPa","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onblog%2FAiPa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onblog%2FAiPa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onblog%2FAiPa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onblog%2FAiPa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/onblog","download_url":"https://codeload.github.com/onblog/AiPa/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251666361,"owners_count":21624298,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java-8","jsoup"],"created_at":"2024-10-03T11:38:55.044Z","updated_at":"2025-04-30T08:14:17.877Z","avatar_url":"https://github.com/onblog.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 一款小巧、灵活的Java多线程爬虫框架（AiPa）爱爬\n\n## 1.简介\n\nAiPa 是一款小巧，灵活，扩展性高的多线程爬虫框架。\n\nAiPa 依赖当下最简单的HTML解析器Jsoup。\n\nAiPa 只需要使用者提供网址集合，即可在多线程下自动爬取，并对一些异常进行处理。\n\n## 2.Maven\n直接引入\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003ecom.github.onblog\u003c/groupId\u003e\n    \u003cartifactId\u003eAiPa\u003c/artifactId\u003e\n    \u003cversion\u003e2.0.0.RELEASE\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n## 3.使用\n\n先来看下一个简单完整的示例程序：\n\n必须实现的接口\n```java\npublic class MyAiPaWorker implements AiPaWorker {\n\n    @Override\n    public String run(Document doc, AiPaUtil util) {\n        //使用JSOUP进行HTML解析获取想要的div节点和属性\n        //保存在数据库或本地文件中\n        //新增aiPaUtil工具类可以再次请求网址\n        return doc.title() + doc.body().text();\n    }\n\n    @Override\n    public Boolean fail(String link) {\n        //任务执行失败\n        //可以记录失败网址\n        //记录日志\n        return false;\n    }\n}\n```\n\nMain方法\n\n```java\n    public static void main(String[] args) throws InstantiationException, IllegalAccessException, ExecutionException, InterruptedException {\n        //准备网址集合\n        List\u003cString\u003e linkList = new ArrayList\u003c\u003e();\n        linkList.add(\"http://xxx.com/123.html\");\n        linkList.add(\"http://xxx.com/456.html\");\n        linkList.add(\"http://xxx.com/789.html\");\n        //第一步：新建AiPa实例\n        AiPaExecutor aiPaExecutor = AiPa.newInstance(new MyAiPaWorker()).setCharset(Charset.forName(\"GBK\"));\n        //第二步：提交任务\n        for (int i = 0; i \u003c 10; i++) {\n            aiPaExecutor.submit(linkList);\n        }\n        //第三步：读取返回值\n        List\u003cFuture\u003e futureList = aiPaExecutor.getFutureList();\n        for (int i = 0; i \u003c futureList.size(); i++) {\n            //get() 方法会阻塞当前线程直到获取返回值\n            System.out.println(futureList.get(i).get());\n        }\n        //第四步：关闭线程池\n        aiPaExecutor.shutdown();\n    }\n```\n\n通过`AiPa.newInstance()`方法直接创建一个新的AiPa实例，该方法必须要传入 AiPaWorker 接口的实现类。\n\n### 3.1 AiPaWorker接口\n\nAiPaWorker 接口是用户必须要实现的业务类。\n\n该接口方法如下：\n\n```java\npublic interface AiPaWorker\u003cT,S\u003e {\n    /**\n     * 如何解析爬下来的HTML文档？\n     * @param doc JSOUP提供的文档\n     * @param util 爬虫工具类\n     * @return\n     */\n    T run(Document doc, AiPaUtil util);\n\n    /**\n     * run方法异常则执行fail方法\n     * @param link 网址\n     * @return\n     */\n    S fail(String link);\n}\n```\n\n`run()`方法是用户自定义处理爬取的HTML内容，一般是利用Jsoup的Document类进行解析，获取节点或属性等，然后保存到数据库或本地文件中。如果在业务方法需要再次请求URL，可以使用工具类Util。\n\n`fail()`方法是当run()方法出现异常或爬取网页时异常，多次处理无效的情况下进入的方法，该方法的参数为此次出错的网址。一般是对其进行日志记录等操作。\n\n### 3.2 解码，最多失败次数，请求头\n\n通过AiPa获取实例后，可以直接在后面跟着设置一大堆属性，比如：setCharset、setThreads、setMaxFailCount等，这些属性啥意思，下面以表格的形式说明一下：\n\n| 方法                | 说明                                                         |\n| ------------------- | ------------------------------------------------------------ |\n| **setThreads**      | 工作线程数，默认CPU数量+1，你也可以设置CPU*2等等             |\n| **setMaxFailCount** | 最大失败次数，也就是爬网站出现异常，再次爬一共尝试多少次，默认5 |\n| setCharset          | 网页的编码，碰到乱码设置这个，默认UTF-8                      |\n| setHeader           | 设置请求头，只接受Map\u003cString,String\u003e类型，默认null           |\n| setMethod           | 设置请求方法，默认Method.GET                                 |\n| setTimeout          | 请求解析的等待时间，默认30秒。                               |\n| setUserAgent        | 设置请求的UA，默认电脑版。                                   |\n| setCookies        | 设置Cookie集合，默认null                                  |\n\n上面的一般情况下够用了，如果对这些不满意，嫌太少啥的，下面给了更优秀的解决方案。\n\n### 3.3 自定义爬虫方法\n\n在上面的演示程序中，我们使用了`submit()`方法进行提交任务，默认是使用了Jsoup+上面的那些非加粗属性进行爬取，一般情况下够用，如果要一个一个的扩展Jsoup的方法太累了，于是我想到把爬虫方法提供给用户重，让用户自己去扩展，想用什么爬，想设置什么属性都可以。\n\n下面请看使用Demo：\n\n```java\npublic class MyAiPaUtil extends AiPaUtil {\n\n    @Override\n    public Document getHtmlDocument(String link) throws IOException {\n        // 你可以不用JSOUP，可以使用其它方法进行HTTP请求，但最后需要转为Document格式\n        // 你也可以使用Jsoup实现定制属性\n        Connection connection = Jsoup.connect(link).method(Connection.Method.GET);\n        String body = connection.execute().charset(\"GBK\").body();\n        \n        return Jsoup.parse(body);\n    }\n\n}\n```\n\n然后，再调用submit方法提交任务，代码示例：\n\n```\naiPaExecutor.submit(linkList, MyAiPaUtil.class);\n```\n\n注意：当你重写爬虫方法后，3.2小节的非加粗属性都会失效。\n\n### 3.3 读取返回值与获取线程池\n\n如果你想要读取返回值来看下任务是否执行成功，你可以使用看下上面的程示例序是如何做的。\n\n```\npublic List\u003cFuture\u003e getFutureList()\n```\n\ngetFutureList()方法会返回任务执行之后的结果集合，集合中的成员都是Future类。调用Future对象的 get() 方法会等待当前任务执行完成再返回结果值，也就是会阻塞当前线程。该类还有很多方法，比如get(long timeout, TimeUnit unit)，设置等待时间等等。\n\n```\npublic ExecutorService getExecutor()\n```\n\n该方法会返回AiPa当前使用的Executor线程池，你获取到该线程池后，需要一些使用线程池的一些方法可以自行使用。\n\n### 3.4 如何应对爬取网页时的异常\n\n对于网页爬取时的异常，这真的是个痛点。原因真的很多，你的网络不行，网站服务器的网络不行，在网上有说把请求头中Connection设置为close，不用keep-alive。这个以我爬取几百兆数据的经验告诉你，然并卵。\n\n于是我想出了一种无赖打法，反复爬。爬一次不行就两次，爬两次不行就三次，只要网页是可以正常响应的，基本这个策略没多少问题。当然，万一真的是某个网页就那么独树一帜呢，所以我们设置一个最大值，对于爬取超过最大值的，放弃记录下来，看看啥子情况。在我的这个框架中，也给出了fail()方法专门处理这个问题。\n\n## 4.测试用例\n\n在Java SE测试中。没有使用数据库等，直接控制台打印是没问题的。\n\n在Spring Boot中写了个测试用例，爬取数据保存到数据库，运行也没问题。\n\n```java\n@RunWith(SpringRunner.class)\n@SpringBootTest\npublic class InterApplicationTests {\n\n    @Autowired\n    private DemoResponse demoResponse;\n\n    @Test\n    public void context() throws ExecutionException, InterruptedException {\n        AiPaExecutor executor = AiPa.newInstance(new AiPaWorker() {\n            @Override\n            public Boolean run(Document document, AiPaUtil util) {\n                String title = document.title();\n                demoResponse.save(new DemoEntity(title));\n                return true;\n            }\n\n            @Override\n            public Boolean fail(String s) {\n                demoResponse.save(new DemoEntity(s));\n                return false;\n            }\n        }).setCharset(Charset.forName(\"GBK\"));\n\n        List\u003cString\u003e linkList = new ArrayList\u003c\u003e();\n        linkList.add(\"http://xxx.com/2688.htm\");\n        linkList.add(\"http://xxx.com/2953.htm\");\n        linkList.add(\"http://xxx.com/2995.htm\");\n        linkList.add(\"http://xxx.com/2610.htm\");\n        linkList.add(\"http://xxx.com/3349.htm\");\n        executor.submit(linkList);\n\n        List\u003cFuture\u003e list = executor.getFutureList();\n        for (int i = 0; i \u003c list.size(); i++) {\n            //get() 方法会阻塞当前线程直到获取返回值\n            System.out.println(list.get(i).get());\n        }\n        executor.shutdown();\n    }\n\n}\n```\n\n运行结果：\n\n```\nHibernate: insert into demo (title) values (?)\nHibernate: insert into demo (title) values (?)\nHibernate: insert into demo (title) values (?)\nHibernate: insert into demo (title) values (?)\nHibernate: insert into demo (title) values (?)\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fonblog%2Faipa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fonblog%2Faipa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fonblog%2Faipa/lists"}