{"id":18885834,"url":"https://github.com/code4craft/codecraft","last_synced_at":"2026-02-23T09:30:18.868Z","repository":{"id":8154074,"uuid":"9574479","full_name":"code4craft/codecraft","owner":"code4craft","description":"codecraft repo","archived":false,"fork":false,"pushed_at":"2013-04-23T12:57:09.000Z","size":305,"stargazers_count":2,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-12-31T04:42:40.736Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/code4craft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-04-21T03:50:01.000Z","updated_at":"2017-04-21T08:14:28.000Z","dependencies_parsed_at":"2022-09-01T04:52:14.585Z","dependency_job_id":null,"html_url":"https://github.com/code4craft/codecraft","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code4craft%2Fcodecraft","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code4craft%2Fcodecraft/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code4craft%2Fcodecraft/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code4craft%2Fcodecraft/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/code4craft","download_url":"https://codeload.github.com/code4craft/codecraft/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239859011,"owners_count":19708857,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T07:22:24.267Z","updated_at":"2026-02-23T09:30:18.821Z","avatar_url":"https://github.com/code4craft.png","language":"Java","readme":"webmagic\n---------\n####*一个网络爬虫工具包*\n\nwebmagic的发起源于工作中的需要，其定位是帮助开发者更便捷的开发一个垂直的网络爬虫。webmagic可以便捷的使用xpath和正则表达式进行链接和内容的提取，对于有Java和xpath或者正则基础的开发者，只需编写少量代码即可完成一个定制爬虫。\n\n###哲学###\n\n* Write Less, Do more.\n\n\twebmagic是一个开发者的工具包，它的目标是让开发者可以通过更少的代码，实现一个高质量的爬虫。webmagic内部还集成了一些常见的垂直性爬虫的功能，例如针对页面正文的Readability技术，可以直接对页面的正文进行智能提取。\n\t\n\t以下是爬取oschina博客的一段代码：\n\t\n\t\tSpider.me().processor(new SimplePageProcessor(\"http://my.oschina.net/\", \"http://my.oschina.net/*/blog/*\")).run();\n\n* 简单可用\n\n\twebmagic的功能覆盖整个爬虫的生命周期(链接提取、页面下载、内容抽取、持久化)，是一个完整的爬虫框架。但是与其他Full-Stack的框架不同，webmagic只引入少量约定，大部分功能都通过简单的API调用完成，目的是尽量降低开发者的学习成本。webmagic以jar包的形式存在，并且不依赖任何框架，在程序可以随处进行调用。\n\n* 灵活性\n\n\t参考scrapy的设计，webmagic将爬虫的扩展点分为processor、schedular、downloader、pipeline三个模块，可以通过扩展这些接口实现强大的扩展功能。如可以通过多个Spider实现多线程抓取；可以通过扩展schedular实现断点续传乃至于分布式爬虫；可以通过扩展pipeline实现业务可定制的持久化功能。\n\t\n------\n\n###Get Started\n\t\nwebmagic定制的核心是PageProcessor接口。一个最简单的webmagic爬虫例子是这样的：\n\n\tSpider.me().processor(new SimplePageProcessor(\"http://my.oschina.net/\", \"http://my.oschina.net/*/blog/*\")).run();\n\t\n其中SimplePageProcessor实现如下：\n\n    public class SimplePageProcessor implements PageProcessor {\n\n        private String urlPattern;\n\n        private static final String UA = \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31\";\n\n        private Site site;\n\n        public SimplePageProcessor(String startUrl, String urlPattern) {\n            this.site = Site.me().setStartUrl(startUrl).\n                    setDomain(UrlUtils.getDomain(startUrl)).setUserAgent(UA);\n            this.urlPattern = \"(\"+urlPattern.replace(\".\",\"\\\\.\").replace(\"*\",\"[^\\\"'#]*\")+\")\";\n\n        }\n\n        @Override\n        public void process(Page page) {\n            List\u003cString\u003e requests = page.getHtml().as().rs(urlPattern).toStrings();\n            page.addTargetRequests(requests);\n            page.putField(\"title\", page.getHtml().x(\"//title\"));\n            page.putField(\"content\", page.getHtml().sc());\n        }\n\n        @Override\n        public Site getSite() {\n            return site;\n        }\n    }\n\n---\n\nTODO\n\n\n\t\tpublic class OschinaBlogPageProcesser implements PageProcessor {\n\n        @Override\n        public void process(Page page) {\n            List\u003cString\u003e strings = page.getHtml().rs(\"\u003ca[^\u003c\u003e]*href=[\\\"']{1}(http://my\\\\.oschina\\\\.net/\\\\w+/blog/\\\\d+)[\\\"']{1}\").toStrings();\n            page.addTargetRequests(strings);\n            page.putField(\"title\", page.getHtml().xs(\"//div[@class='BlogEntity']/div[@class='BlogTitle']/h1\"));\n            page.putField(\"content\", page.getHtml().sc());\n            page.putField(\"author\", page.getUrl().r(\"my\\\\.oschina\\\\.net/(\\\\w+)/blog/\\\\d+\"));\n        }\n\n        @Override\n        public Site getSite() {\n            return Site.me().setDomain(\"my.oschina.net\").setStartUrl(\"http://www.oschina.net/\").\n                    setUserAgent(\"Mozilla/5.0 (Macintosh; Chrome/26.0.1410.65 Safari/537.31\");\n        \t}\n    \t}\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcode4craft%2Fcodecraft","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcode4craft%2Fcodecraft","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcode4craft%2Fcodecraft/lists"}